the-last-thing/text/thething/problem.tex
2021-07-19 11:11:51 +02:00

339 lines
23 KiB
TeX
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

\section{{\Thething} privacy}
\label{sec:lmdk-prob}
{\Thething} privacy is based on differential privacy.
For this reason, we revisit the definition and important properties of differential privacy before moving on to the main ideas of this paper.
Although, its local variant~\cite{duchi2013local} is more compatible with microdata, which is our use case, for the shake of simplicity we stick to the original version of differential privacy.
We refer the interested reader to~\cite{desfontaines2020sok} for a systematic taxonomy of the different variants and extensions of differential privacy, to~\cite{katsomallos2019privacy} for a survey of privacy models for continuous data publishing, and to~\cite{primault2018long} for an organization of the recent contributions in location privacy.
\subsection{Differential privacy}
\label{subsec:dp}
\emph{Differential privacy}~\cite{dwork2006calibrating} is a property of a privacy mechanism $\mathcal{M}$ processing a set of \emph{privacy-sensitive} personal data $D$,
%from a domain $\mathcal{D}$,
while providing quantifiable privacy and utility guarantees.
More specifically, $\mathcal{M}$ satisfies $\varepsilon$-differential privacy for a given `privacy budget' $\varepsilon \in \mathbb{R^+}$, if the ratio of the probabilities of $D$ and $D'$ being true worlds is lower or equal to $e^\varepsilon$, where $D'$ differs in one tuple from $D$.
%cannot decide sure that a tuple exists in the database or not. for every pair of data sets $D, D' $
% \in \mathcal{D}$
%, as defined in Definition~\ref{def:dp}.
%\begin{definition}
% [Differential privacy~\cite{dwork2006calibrating}]
% \label{def:dp}
% A privacy mechanism $\mathcal{M}$ with domain $\mathcal{D}$ and range $\mathcal{O}$, satisfies $\varepsilon$-differential privacy, for a given privacy budget $\varepsilon \in \mathbb{R}+$, if for every pair of data sets $D, D' $
% % \in \mathcal{D}$
% differing in one tuple and all sets $O$ it holds that:
% %\subseteq \mathcal{O}$:
% $$\Pr[\mathcal{M}(D) \in O] \leq e^\varepsilon \Pr[\mathcal{M}(D') \in O]$$
%\end{definition}
%
A widely used privacy mechanism is the \emph{Laplace}
% \kat{add exponential, maybe put references only}
~\cite{dwork2014algorithmic}, which draws randomly a value from the probability distribution of $\textrm{Laplace}(\mu, \frac{\Delta f}{\varepsilon})$.
$\mu$ stands for the original output of the query associated to $\mathcal{M}$ with sensitivity $\Delta f$, and $\frac{\Delta f}{\varepsilon}$ is the scale of the distribution.
% \kat{do we need to know the details of the mechanisms? I would rather put an intuitive description, or in which cases each is preferred.}
% \mk{Most probably we'll need the exponential and Geo-I.}
%A typical example of differential privacy mechanism is the \emph{Laplace mechanism}~\cite{dwork2014algorithmic}.
%It draws randomly a value from the probability distribution of $\textrm{Laplace}(\mu, b)$, where $\mu$ stands for the location parameter and $b > 0$ the scale parameter.
%Here, $\mu$ is equal to the original output value of a query function, and $b$ is the sensitivity of the query function divided by $\varepsilon$.
%The Laplace mechanism works for any function with range the set of real numbers.
A specialization of this mechanism for location data is the \emph{Planar Laplace mechanism}~\cite{andres2013geo}, which is based on a multivariate Laplace distribution and offers a level of protection equal to $\varepsilon$ times a desired protection radius.
Mechanisms that satisfy differential privacy are \emph{composable}.
%, i.e.,~the combination of their results satisfies differential privacy as well.
The \emph{sequential} composition of $\mathcal{M}_1(D)$, $\mathcal{M}_2(D)$ with $\varepsilon_1$, $\varepsilon_2$
%applied on the same $D$
results in $\mathcal{M}(D)$ with $\varepsilon = \varepsilon_1 + \varepsilon_2$.
% The \emph{parallel} composition of $\mathcal{M}_1(D_1),\mathcal{M}_2(D_2)$ with $\varepsilon_1,\varepsilon_2$ results in $\mathcal{M}(D_1\cup D_2)$ with $\varepsilon{=}max(\varepsilon_1,\varepsilon_2)$. The post-processing of the results of a differential private mechanism does not deteriorate the entailed privacy.
% \mk{We don't need it now}
%
%\kat{integrate with next subsection}
%The presence of temporal correlations might result into additional privacy loss due to data releases performed in previous -- \emph{backward privacy loss} $\alpha^B$ -- and subsequent --\emph{forward privacy loss} $\alpha^F$~\cite{cao2017quantifying} -- timestamps.\kat{review -- complete further if space}
%Cao et al.~\cite{cao2017quantifying} propose a method for computing the total temporal privacy loss (TPL) in the presence of temporal correlations and background knowledge. Due to the lack of space, we refer the interested reader to the original publication for the complete definitions and formulas.
\subsection{Problem description and definition}
\label{subsec:prob-set}
%\kat{move flowchart here}
%Our problem setting consists of three entities: (i) data generators (users), (ii) data publishers (trusted non-adversarial entities), and (iii) data consumers (possibly adversarial entities).
Users generate sensitive data, which are processed in a secure and private way by a trusted curator and are later published in order to be consumed by potentially adversarial data analysts.
%The data unit produced by the users is an \emph{event}, i.e., a piece of timestamped user-related information.\kat{should we say geo-stamped?}.
Data are produced as a series of events, which we call time series.
An \emph{event} is defined as a triple of an identifying attribute of an individual and the possibly sensitive data at a timestamp.
%This workflow is repeated in a continuous manner, producing series of events, which we call time series.
%, producing, processing, publishing, and consuming events in a private manner.
%\kat{keep only the terms with a small description.}
%\begin{enumerate}[(i)]
%
% \item \textbf{Data generators} (users) entity $E_g$ interacts with a crowdsensing application and produces continuously privacy-sensitive data items in an arbitrary frequency during the application's usage period $T = (t)_{t \in \mathbb{N}}$.
% Thus, at each timestamp $t$, $E_g$ generates a data set $D_t \in \mathcal{D}$ where each of its members contributes a single data item.
%
% \item \textbf{Data publishers} (trusted non-adversarial) entity $E_p$ receives the data sent by $E_g$ in the form of a series of events in $T$.
% Following the \emph{global} processing and publishing scheme, $E_p$ collects at $t$ a data set $D_t$ and privacy-protects it by applying the respective privacy mechanism $\mathcal{M}_t$.
% $\mathcal{M}_t$ uses independent randomness such that it satisfies $\varepsilon_t$-differential privacy.
%
% \item \textbf{Data consumers} (possibly adversarial) entity $E_c$ receives the result $\mathbf{o}_t$ of the privacy-preserving processing of $D_t$ by $E_p$.
% According to Theorem~\ref{theor:compo-seq-ind}, the overall privacy guarantee of the outputs of $\mathcal{M}$ is equal to the sum of all the privacy budgets of the respective privacy mechanisms that compose $\mathcal{M}$, i.e.,~$\sum_{t \in T}\varepsilon_t$.
%
%\end{enumerate}
%
%We assume that all the interactions between $E_g$ and $E_p$ are secure and private, and thus $E_p$ is considered trusted and non-adversarial by $E_g$.
%Notice that, in a real life scenario, $E_g$ and $E_c$ might overlap with each other, i.e.,~data producers might be data consumers as well.
%
%
%\subsection{Privacy goal}
%\label{subsec:prv-g}
We argue that in continuous user-generated data publishing, events are not equally `significant' in terms of privacy.
% (including contextual information).
%It can be seen as a correspondence to a record in a database, where each individual may participate once.
We term a significant event---according to user- or data-related criteria---as a \emph{\thething}~event.
% As mentioned in Section~\ref{sec:intro}, t
The identification of {\thething} events can be performed manually or automatically~\cite{zhou2004discovering, hariharan2004project}, and is an orthogonal problem to this current work.
%We defer the study of the {\thethings} discovery to a following work.
In this work, we consider the {\thething} timestamps non-sensitive and provided by the user as input along with the privacy budget $\varepsilon$.
% \kat{check that this is mentioned in the intro}
For example, events $p_1$, $p_3$, $p_5$, $p_8$ in Figure~\ref{fig:lmdk-scenario} are {\thething} events.
% relevant to certain user-defined privacy criteria, or to its adjacent data item(s) as well as to the entire data set or parts thereof.
% A significant event or item signals its consequence to us, toward us.
% https://www.quora.com/What-is-the-difference-between-significant-and-important
%\begin{definition}
% [{\Thething} event]
% \label{def:thething-evnt}
% A {\thething} event is a significant---according to user- or data-related criteria---user-generated data item.
%\end{definition}
%In this scenario, these are $1$, $3$, $5$, $8$ and they fall in areas in a dark shade.
%\kat{we must define a series of events before neighbouring series of events}
%\begin{definition}
% [{\Thething} neighboring time series]
% \label{def:thething-nb}
% Two time series are {\thething} neighboring (or adjacent) when they differ by a single {\thething} event.
%% i.e.,~one can be obtained by adding/removing a {\thething} to/from the other.
%\end{definition}
Two time series of equal lengths are \emph{{\thething} neighboring} when they differ by a single {\thething} event.
For example, the time series ($p_1$, \dots, $p_8$) with {\thethings} set the $\{p_1, p_3,p_5\}$ is {\thething} neighboring to the time series of Figure~\ref{fig:lmdk-scenario}.
%This means that we can obtain the first time series by adding/removing one event to/from the second time series.
%to/from any one of two {\thething} neighboring series of events we can obtain the other series.
% Therefore, Corollary~\ref{cor:thething-nb} follows.
We proceed to propose \emph{{\thething} privacy}, a configurable variation of differential privacy for time series (Definition~\ref{def:thething-prv}).
%\begin{corollary}
% \label{cor:thething-nb}
% Two {\thething} neighboring series of events are event neighboring as well.
%\end{corollary}
%\kat{what is event neighboring?}
%\kat{Up to now M was a mechanism, now it is a set of mechanisms?}
\begin{definition}
% [{\Thething} privacy]
\label{def:thething-prv}
Let $\mathcal{M}$ be a privacy mechanism with range $\mathcal{O}$ and domain $\mathcal{S}_T$ being the set of all time series with length $|T|$, where $T$ is a sequence of timestamps.
$\mathcal{M}$ satisfies {\thething} $\varepsilon$-differential privacy (or, simply, {\thething} privacy) if for all sets $O \subseteq \mathcal{O}$, and for every pair of {\thething}-neighboring time series $S_T$, $S_T'$,
% and all $T = (t)_{t \in \mathbb{N}}$,
it holds that
$$Pr[\mathcal{M}(S_T) \in O] \leq e^\varepsilon Pr[\mathcal{M}(S_T') \in O]$$
\end{definition}
% \kat{to rephrase for an easier transition -- mention here user and event level that satisfy {\thething} privacy and add discussion that we can do better and propose the new mechanism}
As discussed in Section~\ref{subsec:prv-levels}, user-level privacy can achieve {\thething} privacy, but it over-perturbs the final data by not distinguishing into {\thething} and regular events.
Theorem~\ref{theor:thething-prv} proposes how to achieve the desired privacy for the {\thethings} (i.e.,~a total budget lower than $\varepsilon$), and in the same time provide better quality overall.
% the existing protection levels of differential privacy do not provide adequate control in time series publishing.
%In Figure~\ref{fig:st-cont} we exhibited how additional user preferences can impact the necessary privacy/utility tradeoff.
%We introduce the notion of {\thethings} and propose {\thething} privacy, a configurable variation of differential privacy for time series.
%By taking into account {\thethings}, i.e.,~timestamps were significant events take place, {\thething} privacy can provide a satisfying protection level while not perturbing the original time series unnecessarily.
%In mereology, the formal study on the relation between parts and the entities they form, it is generally held that the identity of an observable object depends on its \emph{spatiotemporal continuity}~\cite{wiggins1967identity, scaltsas1981identity, hazarika2001qualitative}: the property of well-behaved objects that alter their state in harmony with space and time.
%Considering events that span the entirety of the user-generated time series ensures the spatiotemporal continuity of the users.
%This way, it is possible to acquire more information regarding individuals' identities, and design privacy preserving methods that offer improved privacy and utility guarantees.
\begin{theorem}
% [{\Thething} privacy]
\label{theor:thething-prv}
% A privacy mechanism that protects any timestamp all the {\thething} events in a time series, satisfies {\thething} privacy.
Let $\mathcal{M}$ be a mechanism with input a time series $S_T$, where $T$ is the set of the involved timestamps, and $L \subseteq T$ be the set of {\thething} timestamps.
$\mathcal{M}$ is decomposed to $\varepsilon$-differential private sub-mechanisms $\mathcal{M}_t$, for every $t \in T$, that apply independent randomness to the data item at $t$.
Then, given a privacy budget $\varepsilon$, $\mathcal{M}$ satisfies {\thething} privacy if for every $t$ it holds that
$$ \sum_{i\in L \cup \{t\}} \varepsilon_i \leq \varepsilon$$
\end{theorem}
% \mk{To discuss.}
Due to space constraints, we omit the proof of Theorem~\ref{theor:thething-prv} and defer it for a longer version of this paper.
\subsubsection{{\Thething} privacy mechanisms}
\label{subsec:lmdk-mechs}
% \kat{add the two models -- uniform and dynamic and skip}
%\kat{isn't the uniform distribution a method? there is a section for the methods. }
Figure~\ref{fig:st-cont} shows the simplest model that implements Theorem~\ref{theor:thething-prv}, the \textbf{Uniform} distribution of privacy budget $\varepsilon$ for {\thething} privacy.
% \mk{We capitalize the first letter because it's the name of the method.}
% in comparison with user-level protection.
In this case, it is enough to distribute at each timestamp the total privacy budget divided by the number of timestamps corresponding to {\thethings}, plus one if we are releasing a regular timestamp.
Consequently, at each timestamp we protect every {\thething}, while reserving a part of $\varepsilon$ for the current timestamp.
%In this case, distributing $\frac{\varepsilon}{5}$ can guarantee {\thething} privacy.
% \begin{figure}[htp]
% \centering
% \includegraphics[width=0.9\linewidth]{thething-prv}
% \caption{Uniform application scenario of {\thething} privacy.}
% \label{fig:thething-prv}
% \end{figure}
Next, we propose an \textbf{Adaptive} privacy mechanism taking into account changes in the input data and exploiting the post-processing property of differential privacy.
Initially, it reserves uniformly the available privacy budget for each future release.
At each timestamp, based on a sampling rate the mechanism either publishes with noise the original data or it releases an approximation based on previous releases.
In the case when it publishes with noise the original data, it also calculates the difference between the current and the previous release and compares the difference with the scale of the perturbation ($\frac{\Delta f}{\varepsilon}$).
The outcome of this comparison determines the adaptation of the sampling rate for the next events:
if the scale is greater it means that the input has not changed much, and therefore it must decrease the sampling rate.
In the case when the mechanism approximates a {\thething} (but not a regular timestamp), it distributes the reserved privacy budget
% divided by the number of remaining {\thething} plus one
to the next timestamps.
% Why skipping publications is problematic?
One might argue that we could \textbf{Skip} the \thething\ data releases.
% and limit the number of {\thethings}.
This would result in preserving all of the available privacy budget for regular events (because the set $L \cup \{t\}$ becomes $\{t\}$), equivalently to event-level protection.
In practice, however, this approach can eventually pose arbitrary privacy risks, especially when dealing with geotagged data.
Particularly, sporadic location data publishing~\cite{gambs2010show, russell2018fitness} or misapplying location cloaking~\cite{xssfopes2020tweet} could result in areas with sparse data points, indicating privacy-sensitive locations.
% \mk{WIP}
% \kat{write in text and remove the algorithm}
% \begin{algorithm}
% \caption{Adaptive {\thething} privacy mechanism}
% \label{algo:adapt-lmdk-priv}
% \SetKwInput{KwData}{Input}
% \SetKwInput{KwResult}{Output}
% \SetKwData{diffCur}{diffCur}
% \SetKwData{diffMin}{diffMin}
% \SetKwData{evalCur}{evalCur}
% \SetKwData{evalOrig}{evalOrig}
% \SetKwData{evalSum}{evalSum}
% \SetKwData{metricCur}{metricCur}
% \SetKwData{metricOrig}{metricOrig}
% \SetKwData{opt}{opt}
% \SetKwData{opti}{opt$_i$}
% \SetKwData{optim}{optim}
% \SetKwData{optimi}{optim$_i$}
% \SetKwData{opts}{opts}
% \SetKwData{reg}{reg}
% \SetKwData{S}{$S_T$}
% \SetKwData{L}{$L$}
% \SetKwData{epsilon}{$\varepsilon$}
% \SetKwFunction{calcMetric}{calcMetric}
% \SetKwFunction{evalSeq}{evalSeq}
% \SetKwFunction{getCombs}{getCombs}
% \SetKwFunction{getOpts}{getOpts}
% \DontPrintSemicolon
% \KwData{\S, \L, \epsilon}
% \KwResult{\optim}
% \BlankLine
% % \If{abs($$)}
% % \If{$i \in L$}{
% % \lmdks $\leftarrow$ \lmdks + 1
% % \ForEach{$j \in [i + 1, T]$}{
% % $varepsilon_j \leftarrow varepsilon_j + \frac{\varepsilon_i}{|T| - \lmdks + 1}$
% % }
% % }
% % Evaluate the original
% \metricOrig $\leftarrow$ \calcMetric{$\{t_n\}, \emptyset, \{l_k\}$}\;
% \evalOrig $\leftarrow$ \evalSeq{\metricOrig}\;
% % Get all possible option combinations
% \opts $\leftarrow$ \getOpts{$\{t_n\}, \{l_k\}$}\;
% % Track the minimum (best) evaluation
% \diffMin $\leftarrow$ $\infty$\;
% % Track the optimal sequence (the one with the best evaluation)
% \optim $\leftarrow$ $[]$\;
% \ForEach{\opt $\in$ \opts}{\label{algo:lmdk-sel-opt-for-each}
% \evalSum $\leftarrow 0$\;
% \ForEach{\opti $\in$ \opt}{
% \metricCur $\leftarrow$ \calcMetric{$\{t_n\}, \opti, \{l_k\}$}\;\label{algo:lmdk-sel-opt-comparison}
% \evalSum $\leftarrow$ \evalSum $+$ \evalSeq{\metricCur}\;
% % Compare with current optimal
% \diffCur $\leftarrow \left|\evalSum/\#\opt - \evalOrig\right|$\;
% \If{\diffCur $<$ \diffMin}{
% \diffMin $\leftarrow$ \diffCur\;
% \optim $\leftarrow$ \opt\;
% }
% }
% }\label{algo:lmdk-sel-opt-end}
% \Return{\optim}
% \end{algorithm}
\subsubsection{{\Thething} privacy under temporal correlation}
\label{subsec:correlations}
From the discussion so far, it is evident that for the budget distribution it is not the positions but rather the number of the {\thethings} that matters.
However, this is not the case under the presence of temporal correlation, which is inherent in continuously generated data.
% HMMs have two important independence properties:
% Markov hidden process: future depends on past via the present.
% Current observation independent of all else given current state.
% Intuitively, D^t or D^{t+1} "cuts off" the propagation of the Markov chain.
The Hidden Markov Model~\cite{baum1966statistical} stipulates two important independence properties: (i)~the future(past) depends on the past(future) via the present, and (ii)~the current observation is independent of the rest given the current state.
%Thus, the observation of a data release at a timestamp $t$ depends only on the respective input data set $D_t$, i.e.,~the current state.
Hence, there is independence between an observation at a specific timestamp and previous/next data sets under the presence of the current input data set.
Intuitively, knowing the data set at timestamp $t$ stops the propagation of the Markov chain towards the next or previous timestamps.
%\kat{do we see this in the formula 1 ?}
%when calculating the forward or backward privacy loss respectively.
Cao et al.~\cite{cao2017quantifying} propose a method for computing the total temporal privacy loss $\alpha_t$ at a timestamp $t$ as the sum of the backward and forward privacy loss, $\alpha^B_t$ and $\alpha^F_t$, minus the privacy budget $\varepsilon_t$
to account for the extra privacy loss due to previous and next releases $\pmb{o}$ of $\mathcal{M}$ under temporal correlation.
By Theorem~\ref{theor:thething-prv}, at every timestamp $t$ we consider the data at $t$ and at the {\thething} timestamps $L$.
%According to the Definitions~{\ref{def:bpl} and \ref{def:fpl}}, we calculate the backward and forward privacy loss by taking into account the privacy budget at previous and next data releases respectively.
When sequentially composing the data releases for each timestamp $i$ in $L \cup \{t\}$ we
%calculate the temporal privacy loss $\alpha_t$ at each timestamp $t \in L \cup \{i\}$ by
%consider the previous and next data releases at the timestamps $i^{-}, i^{+} \in L \cup \{t\} \setminus \{i\}$ respectively.
consider the previous releases in the whole time series until the timestamp $i^{-}$ that is exactly before $i$ in the ordered $L {\cup} \{t\}$, and the next data releases in the whole time series until the timestamp $ i^{+}$ that is exactly after $i$ in the ordered $L {\cup }\{t\} $.
%\kat{not sure I understand}
%Thus, we calculate the backward/forward privacy loss by taking into account the data releases after/before the previous/next data item.
That is:
% \dk{do we keep looking at all Landmarks both for backward and forward? I would assume that for backward we are looking to the Landmarks until the i and for the forward to the Landmarks after the i - if we would like to be consistent with Cao. Otherwise the writing here is confusing.}
% \mk{We are discussing about the case where we calculate the tpl at each timestamp i in L+{t}. Therefore, bpl at i is calculated until i- and fpl at i until i+.}
\begin{align}
\adjustbox{max width=0.9\linewidth}{
$\alpha_i =
\underbrace{\ln \frac{\Pr[\pmb{o}_{i^{-} + 1}, \dots, \pmb{o}_i | D_i]}{\Pr[\pmb{o}_{i^{-} + 1}, \dots, \pmb{o}_i | D'_i]}}_{\alpha^B_i} +
\underbrace{\ln \frac{\Pr[\pmb{o}_{i^{+} - 1}, \dots, \pmb{o}_i | D_i]}{\Pr[\pmb{o}_{i^{+} - 1}, \dots, \pmb{o}_i | D'_i]}}_{\alpha^F_i} -
\underbrace{\ln \frac{\Pr[\pmb{o}_i | D_i]}{\Pr[\pmb{o}_i | D'_i]}}_{\varepsilon_i}$
}
\end{align}
Finally, $\alpha_t$ is equal to the sum of all $\alpha_i , i\in L \cup\{t\}$.
%
% where $x_t$ (or $x'_t$) is the potential (neighboring) data item of an individual who is targeted by an adversary with knowledge $\mathbb{D}_t$.
%where $D_t$ and $D'_t$ are the neighboring input data sets (Definition~\ref{def:nb-d-s}) responsible for the output $\pmb{o}_t$.
%Notice that if $t$ is the first or last item in $L \cup \{i\}$ then we need to set $t_{\text{prv}} = 0$ or $t_{\text{nxt}} = \max(T) + 1$.
%In Section~\ref{sec:eval}, we experimentally show how the distribution of {\thethings} impacts the overall privacy loss of the user.