diff --git a/graphics/laplace.pdf b/graphics/laplace.pdf index 5cd77cf..e12acc8 100644 Binary files a/graphics/laplace.pdf and b/graphics/laplace.pdf differ diff --git a/text/preliminaries/privacy.tex b/text/preliminaries/privacy.tex index 0c74765..03b98a3 100644 --- a/text/preliminaries/privacy.tex +++ b/text/preliminaries/privacy.tex @@ -20,24 +20,24 @@ Attribute disclosure appears when it is revealed from (a privacy-protected versi \subsection{Levels} -\label{subsec:privacy-levels} +\label{subsec:prv-levels} -The information disclosure that a data release may entail is often linked to the protection level that a privacy-preserving algorithm is trying to achieve. -More specifically, in continuous data publishing the privacy protection level is considered with respect to not only the users but also to the \emph{events} occurring in the data. -An event is considered as a pair of an identifying attribute of an individual and the sensitive data (including contextual information), and can be seen as a correspondence to a record in a database, where each individual may participate once. -Data publishers typically release events in the form of data points' sequences usually indexed in time order (time series), and geotagged, e.g.,~(`Dewey', `at home at Montmartre at $t_1$'), \dots, (`Quackmore', `dining at Opera at $t_1$'). -The term `users' is used to refer to the \emph{individuals}, also known as \emph{participants}, who are the source of the processed and published data. +The information disclosure that a data release may entail is linked to the protection level that indicates \emph{what} a privacy-preserving algorithm is trying to achieve. +More specifically, in continuous data publishing we consider the privacy protection level with respect to not only the users but also to the \emph{events} occurring in the data. +An event is a pair of an identifying attribute of an individual and the sensitive data (including contextual information) and we can see it as a correspondence to a record in a database, where each individual may participate once. +Data publishers typically release events in the form of sequences of data items, usually indexed in time order (time series) and geotagged, e.g.,~(`Dewey', `at home at Montmartre at $t_1$'), \dots, (`Quackmore', `dining at Opera at $t_1$'). +We use the term `users' to refer to the \emph{individuals}, also known as \emph{participants}, who are the source of the processed and published data. Therefore, they should not be confused with the consumers of the released data sets. Users are subject to privacy attacks, and thus are the main point of interest of privacy protection mechanisms. In more detail, the privacy protection levels are: -\begin{itemize} - \item \emph{Event}~\cite{dwork2010differential, dwork2010pan}---\emph{any single event} of any individual is protected. - \item \emph{User}~\cite{dwork2010differential, dwork2010pan}---\emph{all the events} of any individual, spanning the observed event sequence, are protected. - \item \emph{$w$-event}~\cite{kellaris2014differentially}---\emph{any sequence of $w$ events}, within the released series of events, of any individual is protected. -\end{itemize} +\begin{enumerate}[(a)] + \item \emph{Event}~\cite{dwork2010differential, dwork2010pan}---limits the privacy protection to \emph{any single event} in a time series, providing maximum data utility. + \item \emph{$w$-event}~\cite{kellaris2014differentially}---provides privacy protection to \emph{any sequence of $w$ events} in a time series. + \item \emph{User}~\cite{dwork2010differential, dwork2010pan}---protects \emph{all the events} in a time series, providing maximum privacy protection. +\end{enumerate} -Figure~\ref{fig:privacy-levels} demonstrates the application of the possible protection levels on the statistical data of Example~\ref{ex:continuous}. +Figure~\ref{fig:prv-levels} demonstrates the application of the possible protection levels on the statistical data of Example~\ref{ex:continuous}. For instance, in event-level (Figure~\ref{fig:level-event}) it is hard to determine whether Quackmore was dining at Opera at $t_1$. Moreover, in user-level (Figure~\ref{fig:level-user}) it is hard to determine whether Quackmore was ever included in the released series of events at all. Finally, in $2$-event-level (Figure~\ref{fig:level-w-event}) it is hard to determine whether Quackmore was ever included in the released series of events between the timestamps $t_1$ and $t_2$, $t_2$ and $t_3$, etc. (i.e.,~for a window $w = 2$). @@ -54,18 +54,18 @@ Finally, in $2$-event-level (Figure~\ref{fig:level-w-event}) it is hard to deter \includegraphics[width=.32\linewidth]{level-w-event}% }\hspace{\fill} \caption{Protecting the data of Table~\ref{tab:continuous-statistical} on (a)~event-, (b)~user-, and (c)~$2$-event-level. A suitable distortion method can be applied accordingly.} - \label{fig:privacy-levels} + \label{fig:prv-levels} \end{figure} -Contrary to event-level that provides privacy guarantees for a single event, user- and $w$-event-level offer stronger privacy protection by protecting a series of events. -In use-cases that involve infinite data, event- and $w$-event-level attain an adequate balance between data utility and user privacy, whereas user-level is more appropriate when the span of data observation is predefined. +Contrary to event-level, that provides privacy guarantees for a single event, user- and $w$-event-level offer stronger privacy protection by protecting a series of events. +Event- and $w$-event-level handle better scenarios of infinite data observation, whereas user-level is more appropriate when the span of data observation is finite. $w$-event- is narrower than user-level protection due to its sliding window processing methodology. -In the extreme cases where $w$ is set to either $1$ or to the size of the entire length of the series of events, $w$-event- matches event- or user-level protection, respectively. -Although the described levels have been coined in the context of \emph{differential privacy}~\cite{dwork2006calibrating}, a seminal privacy method that we will discuss in more detail in Section~\ref{subsec:privacy-statistical}, it is possible to apply their definitions to other privacy protection techniques as well. +In the extreme cases where $w$ is equal to either $1$ or to the size of the entire length of the time series, $w$-event- matches event- or user-level protection, respectively. +Although the described levels have been coined in the context of \emph{differential privacy}~\cite{dwork2006calibrating}, a seminal privacy method that we will discuss in more detail in Section~\ref{subsec:prv-statistical}, it is possible to apply their definitions to other privacy protection techniques as well. \subsection{Attacks} -\label{subsec:privacy-attacks} +\label{subsec:prv-attacks} Information disclosure is typically achieved by combining supplementary (background) knowledge with the released data or by setting unrealistic assumptions while designing the privacy-preserving algorithms. In its general form, this is known as \emph{adversarial} or \emph{linkage} attack. @@ -121,7 +121,7 @@ Even though many works directly refer to the general category of linkage attacks \end{itemize} -The first sub-category of attacks has been mainly addressed in works on snapshot microdata publishing, and is still present in continuous publishing; however, algorithms for continuous publishing typically accept the proposed solutions for the snapshot publishing scheme (see discussion over $k$-anonymity and $l$-diversity in Section~\ref{subsec:privacy-seminal}). +The first sub-category of attacks has been mainly addressed in works on snapshot microdata publishing, and is still present in continuous publishing; however, algorithms for continuous publishing typically accept the proposed solutions for the snapshot publishing scheme (see discussion over $k$-anonymity and $l$-diversity in Section~\ref{subsec:prv-seminal}). This kind of attacks is tightly coupled with publishing the (privacy-protected) sensitive attribute value. An example is the lack of diversity in the sensitive attribute domain, e.g.,~if all users in the data set of Table~\ref{tab:snapshot-micro} shared the same \emph{running} Status (the sensitive attribute). The second and third subcategory are attacks emerging (mostly) in continuous publishing scenarios. @@ -132,7 +132,7 @@ In order to better protect the privacy of Donald in case of attacks, the data sh \subsection{Operations} -\label{subsec:privacy-operations} +\label{subsec:prv-operations} Protecting private information, which is known by many names (obfuscation, cloaking, anonymization, etc.), is achieved by using a specific basic privacy protection operation. Depending on the intervention that we choose to perform on the original data, we identify the following operations: @@ -157,13 +157,13 @@ For these reasons, there will be no further discussion around this family of tec \subsection{Seminal works} -\label{subsec:privacy-seminal} +\label{subsec:prv-seminal} For completeness, in this section we present the seminal works for privacy-preserving data publishing, which, even though originally designed for the snapshot publishing scenario, have paved the way, since many of the works in privacy-preserving continuous publishing are based on or extend them. \subsubsection{Microdata} -\label{subsec:privacy-micro} +\label{subsec:prv-micro} Sweeney coined \emph{$k$-anonymity}~\cite{sweeney2002k}, one of the first established works on data privacy. A released data set features $k$-anonymity protection when the sequence of values for a set of identifying attributes, called the \emph{quasi-identifiers}, is the same for at least $k$ records in the data set. @@ -186,53 +186,81 @@ Proposed solutions include rearranging the attributes, setting the whole attribu \subsubsection{Statistical data} -\label{subsec:privacy-statistical} +\label{subsec:prv-statistical} While methods based on $k$-anonymity have been mainly employed for releasing microdata, \emph{differential privacy}~\cite{dwork2006calibrating} has been proposed for releasing high utility aggregates over microdata while providing semantic privacy guarantees. Differential privacy is algorithmic, it ensures that any adversary observing a privacy-protected output, no matter his/her computational power or auxiliary information, cannot conclude with absolute certainty if an individual is included in the input data set. Moreover, it quantifies and bounds the impact that the addition/removal of the data of an individual to/from an input data set has on the derived privacy-protected aggregates. -In its formal definition, a \emph{privacy mechanism} $\mathcal{M}$, which outputs a query answer with some injected randomness, satisfies $\varepsilon$-differential privacy for a user-defined privacy budget $\varepsilon$~\cite{mcsherry2009privacy} if for all pairs of \emph{neighboring} (i.e.,~differing by the data of an individual) data sets $D$ and $D'$, it holds that: -$$\Pr[\mathcal{M}(D) \in O]\leq e^\varepsilon \Pr[\mathcal{M}(D') \in O],$$ +\begin{definition} + [Neighboring data sets] + \label{def:nb-d-s} + Two data sets are neighboring (or adjacent) when they differ by at most one tuple, i.e.,~one can be obtained by adding/removing the data of an individual to/from the other. +\end{definition} -\noindent where $\Pr[\cdot]$ denotes the probability of an event, and $O$ is the world of possible outputs of a mechanism $\mathcal{M}$. -As the definition implies, for low values of $\varepsilon$, $\mathcal{M}$ achieves stronger privacy protection since the probabilities of $D$ and $D'$ being true worlds are similar, but the utility of the mechanism's output is reduced since more randomness is introduced. -The privacy budget $\varepsilon$ has a non-zero and positive value, and is usually set to $0.01$, $0.1$, or, in some cases, $\ln2$ or $\ln3$~\cite{lee2011much}. +More precisely, differential privacy quantifies the impact of the addition/removal of a single tuple in $D$ on the output $\pmb{o}$ of $\mathcal{M}$. +The distribution of all $\pmb{o}$, in some range $\mathcal{O}$, is not affected \emph{substantially}, i.e.,~it changes only slightly due to the modification of any one tuple in all possible $D \in \mathcal{D}$. +Thus, differential privacy is algorithmic, it ensures that any adversary observing any $\pmb{o}$ cannot conclude with absolute certainty whether or not any individual is included in any $D$. +Its performance is irrelevant to the computational power and auxiliary information available to an adversary observing the outputs of $\mathcal{M}$. -A typical mechanism example is the \emph{Laplace mechanism}~\cite{dwork2014algorithmic}, which draws randomly a value from the probability distribution of $\textrm{Laplace}(\mu, b)$, where $\mu$ stands for the location parameter and $b > 0$ the scale parameter. +\begin{definition} + [Differential privacy] + \label{def:dp} + A privacy mechanism $\mathcal{M}$, with domain $\mathcal{D}$ and range $\mathcal{O}$, satisfies $\varepsilon$-differential privacy, for a given privacy budget $\varepsilon$, if for every pair of neighboring data sets $D, D' \in \mathcal{D}$ and all sets $O \subseteq \mathcal{O}$: + $$\Pr[\mathcal{M}(D) \in O] \leq e^\varepsilon \Pr[\mathcal{M}(D') \in O]$$ +\end{definition} + +\noindent $\Pr[\cdot]$ denotes the probability of $\mathcal{M}$ generating $\pmb{o}$ as output, from a set of $O \subseteq \mathcal{O}$, when given any version of $D$ as input. +The privacy budget $\varepsilon$ is a positive real number that represents the user-defined privacy goal~\cite{mcsherry2009privacy}. +As the definition implies, $\mathcal{M}$ achieves stronger privacy protection for lower values of $\varepsilon$ since the probabilities of $D$ and $D'$ being true worlds are similar, but the utility of $\pmb{o}$ is reduced since more randomness is introduced by $\mathcal{M}$. +The privacy budget $\varepsilon$ is usually set to $0.01$, $0.1$, or, in some cases, $\ln2$ or $\ln3$~\cite{lee2011much}. + +\begin{definition} + [Query function sensitivity] + \label{def:qry-sens} + The sensitivity of a query function $f$ for all neighboring data sets $D, D' \in \mathcal{D}$ is: + $$\Delta f = \max_{D, D' \in \mathcal{D}} \lVert {f(D) - f(D')} \rVert_{1}$$ +\end{definition} + +The pertinence of differential privacy methods is inseparable from the query's function sensitivity. +The presence/absence of a single record can only change the result slightly, and therefore differential privacy methods are best for low sensitivity queries such as counts. +However, sum and max queries can be problematic since a single (very different) value could change the output noticeably, making it necessary to add a lot of noise to the query's answer. +Furthermore, asking a series of queries may allow the disambiguation between possible data sets, making it necessary to add even more noise to the outputs. +For this reason, after a series of queries exhausts the available privacy budget the data set has to be discarded. +Keeping the original guarantee across multiple queries that require different/new answers requires the injection of noise proportional to the number of the executed queries, and thus destroying the utility of the output. + +\paragraph{Privacy mechanisms} +\label{subsec:prv-mech} +A typical example of differential privacy mechanism is the \emph{Laplace mechanism}~\cite{dwork2014algorithmic}. +It draws randomly a value from the probability distribution of $\textrm{Laplace}(\mu, b)$, where $\mu$ stands for the location parameter and $b > 0$ the scale parameter (Figure~\ref{fig:laplace}). Here, $\mu$ is equal to the original output value of a query function, and $b$ is the sensitivity of the query function divided by $\varepsilon$. The Laplace mechanism works for any function with range the set of real numbers. A specialization of this mechanism for location data is the \emph{Planar Laplace mechanism}~\cite{andres2013geo}, which is based on a multivariate Laplace distribution. -For query functions that do not return a real number, e.g.,~`What is the most visited country this year?' or in cases where perturbing the value of the output will completely destroy its utility, e.g.,~`What is the optimal price for this auction?', most works in the literature use the \emph{Exponential mechanism}~\cite{dwork2014algorithmic}. + +\begin{figure}[htp] + \centering + \includegraphics[width=.7\linewidth]{laplace} + \caption{A Laplace distribution for location $\mu = 2$ and scale $b = 1$.} + \label{fig:laplace} +\end{figure} + +For query functions that do not return a real number, e.g.,~`What is the most visited country this year?' or in cases where perturbing the value of the output will completely destroy its utility, e.g.,~`What is the optimal price for this auction?', most works in the literature use the \emph{Exponential mechanism}~\cite{mcsherry2007mechanism}. This mechanism utilizes a utility function $u$ that maps (input data set $D$, output value $r$) pairs to utility scores, and selects an output value $r$ from the input pairs, with probability proportional to $\exp(\frac{\varepsilon u(D, r)}{2\Delta u})$, where $\Delta u$ is the sensitivity of the utility function. + Another technique for differential privacy mechanisms is the \emph{randomized response}~\cite{warner1965randomized}. It is a privacy-preserving survey method that introduces probabilistic noise to the statistics of a research by randomly instructing respondents to answer truthfully or `Yes' to a sensitive, binary question. The technique achieves this randomization by including a random event, e.g.,~the flip of a fair coin. The respondents reveal to the interviewers only their answer to the question, and keep as a secret the result of the random event (i.e.,~if the coin was tails or heads). Thereafter, the interviewers can calculate the probability distribution of the random event, e.g.,~$\frac{1}{2}$ heads and $\frac{1}{2}$ tails, and thus they can roughly eliminate the false responses and estimate the final result of the research. -Differential privacy mechanisms satisfy two composability properties: \emph{sequential} and \emph{parallel}~\cite{mcsherry2009privacy, soria2016big}. -Due to the sequential composability property, the total privacy level of two independent mechanisms $\mathcal{M}_1$ and $\mathcal{M}_2$ over the same data set that satisfy $\varepsilon_1$ and $\varepsilon_2$, respectively, equals to $\varepsilon_1 + \varepsilon_2$. -The parallel composability property dictates that, when the mechanisms $\mathcal{M}_1$ and $\mathcal{M}_2$ are applied over disjoint subsets of the same data set, then the overall privacy level is $\max_{ i\in\{1,2\}}\varepsilon_i $. -Every time a data publisher interacts with (any part of) the original data set, it is mandatory to consume some of the available privacy budget according to the composability properties. -This is a necessity, so as to ensure that there will be no further arbitrary privacy loss, when the released data sets will be acquired by adversaries (or simple users). -However, \emph{post-processing} the output of a differential privacy mechanism can be done without using any additional privacy budget. -Naturally, using the same (or different) privacy mechanism(s) multiple times to interact with raw data in combination with already perturbed data, implies that the privacy guarantee of the final output will be calculated according to sequential composition. - -Differential privacy methods are best for low sensitivity queries such as counts, because the presence/\allowbreak absence of a single record can only change the result slightly. -However, sum and max queries can be problematic, since a single but very different value could change the output noticeably, making it necessary to add a lot of noise to the query's answer. -Furthermore, asking a series of queries may allow the disambiguation between possible data sets, making it necessary to add even more noise to the outputs. -For this reason, after a series of queries exhausts the available privacy budget, the data set has to be discarded. -Keeping the original guarantee across multiple queries that require different/\allowbreak new answers, one must inject noise proportional to the number of the executed queries, and thus destroying the utility of the output. - A special category of differential privacy-preserving algorithms is that of \emph{pan-private} algorithms~\cite{dwork2010pan}. Pan-private algorithms hold their privacy guarantees even when snapshots of their internal state (memory) are accessed during their execution by an external entity, e.g.,~subpena, security breach, etc. There are two intrusion types that a data publisher has to deal with when designing a pan-private mechanism: \emph{single unannounced}, and \emph{continual announced} intrusion. In the first, the data publisher assumes that the mechanism's state is observed by the external entity one unique time, without the data publisher ever being notified about it. In the latter, the external entity gains access to the mechanism's state multiple times, and the publisher is notified after each time. The simplest approach to deal with both cases is to make sure that the data in the memory of the mechanism have constantly the same distribution, i.e.,~they are differentially private. -Notice that this must hold throughout the mechanism's lifetime, even before/\allowbreak after it processes any sensitive data point(s). +Notice that this must hold throughout the mechanism's lifetime, even before/\allowbreak after it processes any sensitive data item(s). The notion of differential privacy has highly influenced the research community, resulting in many follow-up publications (\cite{mcsherry2007mechanism, kifer2011no, zhang2017privbayes} to mention a few). We distinguish here \emph{Pufferfish}~\cite{kifer2014pufferfish} and \emph{geo-indistinguishability}~\cite{andres2013geo,chatzikokolakis2015geo}. @@ -249,6 +277,249 @@ This similarity depends on $r$ because the closer two locations are, the more li Intuitively, the definition implies that if an adversary learns the published location for an individual, the adversary cannot infer the individual's true location, out of all the points in a radius $r$, with a certainty higher than a factor depending on $l$. The technique adds random noise drawn from a multivariate Laplace distribution to individuals' locations, while taking into account spatial boundaries and features. +\paragraph{Composition} +\label{subsec:compo} + +Mechanisms that satisfy differential privacy are \emph{composable}, i.e.,~the combination of their results satisfy differential privacy as well. +In this section, we provide an overview of the most prominent composition theorems that instruct data publishers \emph{how} to estimate the overall privacy protection when utilizing a series of differential privacy mechanisms. + +\begin{theorem} + [Composition] + \label{theor:compo} + Any combination of a set of independent differential privacy mechanisms satisfying a corresponding set of privacy guarantees shall satisfy differential privacy as well, i.e.,~provide a differentially private output. +\end{theorem} + +Generally, when we apply a series of independent (i.e.,~in the way that they inject noise) differential privacy mechanisms on independent data, we can calculate the privacy level of the resulting output according to the \emph{sequential} composition property~\cite{mcsherry2009privacy, soria2016big}. + +\begin{theorem} + [Sequential composition on independent data] + \label{theor:compo-seq-ind} + The privacy guarantee of $m \in \mathbb{Z}^+$ independent privacy mechanisms, satisfying $\varepsilon_1$-, $\varepsilon_2$-, \dots, $\varepsilon_m$-differential privacy respectively, when applied over the same data set equals to $\sum_{i = 1}^m \varepsilon_i$. +\end{theorem} + +Notice that the sequential composition corresponds to the worst case scenario where each time we use a mechanism we have to invest some (or all) of the available privacy budget. +In the special case that we query disjoint data sets, we can take advantage of the \emph{parallel} composition property~\cite{mcsherry2009privacy, soria2016big}, and thus spare some of the available privacy budget. + +\begin{theorem} + [Parallel composition on independent data] + \label{theor:compo-par-ind} + When $m \in \mathbb{Z}^+$ independent privacy mechanisms, satisfying $\varepsilon_1$-, $\varepsilon_2$-,\dots, $\varepsilon_m$-differential privacy respectively, are applied over disjoint independent subsets of a data set, they provide a privacy guarantee equal to $\max_{i \in [1, m]} \varepsilon_i$. +\end{theorem} + +When the users consider recent data releases more privacy sensitive than distant ones, we estimate the overall privacy loss in a time fading manner according to a temporal discounting function, e.g.,~exponential, hyperbolic,~\cite{farokhi2020temporally}. + +\begin{theorem} + [Sequential composition with temporal discounting] + \label{theor:compo-seq-disc} + A set of $m \in \mathbb{Z}^+$ independent privacy mechanisms, satisfying $\varepsilon_1$-, $\varepsilon_2$-,\dots, $\varepsilon_m$-differential privacy respectively, satisfy $\sum_{i = 1}^m g(i) \varepsilon_i$ differential privacy for a discount function $g$. +\end{theorem} + +% The presence of temporal correlations might result into additional privacy loss consisting of \emph{backward privacy loss} $\alpha^B$ and \emph{forward privacy loss} $\alpha^F$~\cite{cao2017quantifying}. +Cao et al.~\cite{cao2017quantifying} propose a method for computing the temporal privacy loss (TPL) of a differential privacy mechanism in the presence of temporal correlations and background knowledge. +The goal of their technique is to guarantee privacy protection and to bound the privacy loss at every timestamp under the assumption of independent data releases. +It calculates the temporal privacy loss as the sum of the backward and forward privacy loss minus the default privacy loss $\varepsilon$ of the mechanism (because it is counted twice in the aforementioned entities). +This calculation is done for each individual that is included in the original data set and the overall temporal privacy loss is equal to the maximum calculated value at every timestamp. +The backward/forward privacy loss at any timestamp depends on the backward/forward privacy loss at the previous/next timestamp, the backward/forward temporal correlations, and $\varepsilon$. + +\begin{definition} + [Temporal privacy loss (TPL)] + \label{def:tpl} + The potential privacy loss of a privacy mechanism at a timestamp $t \leq T$ due to a series of outputs $\pmb{o}_1$, \dots, $\pmb{o}_T$ and temporal correlations in its input $D_t$ with respect to any adversary, targeting an individual with potential data items $x_t$ (or $x'_t$) and having knowledge $\mathbb{D}_t$ equal to $D_t - \{x_t\}$ (or $D'_t - \{x'_t\}$), is defined as: + + \begin{equation} + \label{eq:tpl} + \alpha_t = \sup_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_T} \ln \frac{\Pr[\pmb{o}_1, \dots, \pmb{o}_T | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_1, \dots, \pmb{o}_T | x'_t, \mathbb{D}_t]} + \end{equation} +\end{definition} +% +By analyzing Equation~\ref{eq:tpl} we get the following: + +\begin{align} + \label{eq:tpl-1} + (\ref{eq:tpl}) = & \underbrace{\sup_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_1, \dots, \pmb{o}_t| x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_1, \dots, \pmb{o}_t | x'_t, \mathbb{D}_t]}}_{\text{Backward privacy loss}\ (\alpha^B_t)} \nonumber \\ + & + \underbrace{\sup_{x_t, x'_t, \pmb{o}_t, \dots, \pmb{o}_T} \ln \frac{\Pr[\pmb{o}_t, \dots, \pmb{o}_T| x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t, \dots, \pmb{o}_T | x'_t, \mathbb{D}_t]}}_{\text{Forward privacy loss}\ (\alpha^F_t)} \nonumber \\ + & - \underbrace{\sup_{x_t, x'_t, \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t] }{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]}}_{\text{Present privacy loss}\ (\varepsilon_t)} +\end{align} + +\begin{definition} + [Backward privacy loss (BPL)] + \label{def:bpl} + The potential privacy loss of a privacy mechanism at a timestamp $t \leq T$ due to outputs $\pmb{o}_1$, \dots, $\pmb{o}_t$ and temporal correlations in its input $D_t$ with respect to any adversary, targeting an individual with potential data items $x_t$ (or $x'_t$) and having knowledge $\mathbb{D}_t$ equal to $D_t - \{x_t\}$ (or $D'_t - \{x'_t\}$), is called backward privacy loss and is defined as: + + \begin{equation} + \label{eq:bpl-1} + \alpha^B_t = \sup_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_1, \dots, \pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_1, \dots, \pmb{o}_t | x'_t, \mathbb{D}_t]} + \end{equation} + +\end{definition} +% +From differential privacy we have the assumption that $\pmb{o}_1$, \dots, $\pmb{o}_t$ are independent events. +Therefore, according to the Bayesian theorem, we can write Equation~\ref{eq:bpl-1} as: + +\begin{align} + \label{eq:bpl-2} + (\ref{eq:bpl-1}) = & \sup_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1}| x_t, \mathbb{D}_t] \Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x'_t, \mathbb{D}_t] \Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]} \nonumber \\ + = & \sup_{x_t, x_t', \pmb{o}_1, \dots, \pmb{o}_{t - 1}} \ln \frac{\Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x_t', \mathbb{D}_t]} \nonumber \\ + & + \sup_{x_t, x_t', \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]} +\end{align} +% +Applying the law of total probability to the first term of Equation~\ref{eq:bpl-2} for all the possible data $x_{t - 1}$ (or $x'_{t - 1}$) and $\mathbb{D}_{t - 1}$ we get the following: + +\begin{align} + \label{eq:bpl-3} + (\ref{eq:bpl-2}) = & + \adjustbox{max width=0.9\linewidth}{ + $\sup\limits_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_{t - 1}} \ln \frac{\sum\limits_{x_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x_t, \mathbb{D}_t, x_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x_{t - 1}, \mathbb{D}_{t - 1} | x_t, \mathbb{D}_t]}{\sum\limits_{x'_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x'_t, \mathbb{D}_t, x'_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x'_{t - 1}, \mathbb{D}_{t - 1} | x'_t, \mathbb{D}_t]}$ + } \nonumber \\ + & \adjustbox{max width=0.3\linewidth}{ + $+ \sup\limits_{x_t, x_t', \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]}$ + } +\end{align} +% +Since $\mathbb{D}_t$ is equal to $D_t - \{x_t\}$ (or $D'_t - \{x'_t\}$), and thus is constant and independent of every possible $x_t$ (or $x'_t$), $\forall t \leq T$, Equation~\ref{eq:bpl-3} can be written as: + +\begin{align} + \label{eq:bpl-4} + (\ref{eq:bpl-3}) = & + \adjustbox{max width=0.9\linewidth}{ + $\sup\limits_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_{t - 1}} \ln \frac{\sum\limits_{x_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x_t, \mathbb{D}_t, x_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x_{t - 1} | x_t, \mathbb{D}_t] \Pr[\mathbb{D}_{t - 1} | x_t, \mathbb{D}_t]}{\sum\limits_{x'_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x'_t, \mathbb{D}_t, x'_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x'_{t - 1} | x'_t, \mathbb{D}_t] \Pr[\mathbb{D}_{t - 1} | x'_t, \mathbb{D}_t]}$ + } \nonumber \\ + & \adjustbox{max width=0.275\linewidth}{ + $+ \sup\limits_{x_t, x_t', \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]}$ + } \nonumber \\ + = & \adjustbox{max width=0.825\linewidth}{ + $\sup\limits_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_{t - 1}} \ln \frac{\sum\limits_{x_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x_t, \mathbb{D}_t, x_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x_{t - 1} | x_t] \Pr[\mathbb{D}_{t - 1} | \mathbb{D}_t]}{\sum\limits_{x'_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x'_t, \mathbb{D}_t, x'_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x'_{t - 1} | x'_t] \Pr[\mathbb{D}_{t - 1} | \mathbb{D}_t]}$ + } \nonumber \\ + & \adjustbox{max width=0.275\linewidth}{ + $+ \sup\limits_{x_t, x_t', \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]}$ + } \nonumber \\ + = & \adjustbox{max width=0.7\linewidth}{ + $\sup\limits_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_{t - 1}} \ln \frac{\sum\limits_{x_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x_t, \mathbb{D}_t, x_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x_{t - 1} | x_t]}{\sum\limits_{x'_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x'_t, \mathbb{D}_t, x'_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x'_{t - 1} | x'_t]}$ + } \nonumber \\ + & \adjustbox{max width=0.275\linewidth}{ + $+ \sup\limits_{x_t, x_t', \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]}$ + } +\end{align} +% +The outputs $\pmb{o}_1$, \dots, $\pmb{o}_t$ and $x_t$ (or $x'_t$) are conditionally independent in the presence of +$x_{t - 1}$ (or $x'_{t - 1}$), and thus Equation~\ref{eq:bpl-4} can be written as: + +\begin{align} + \label{eq:bpl-5} + (\ref{eq:bpl-4}) = & + \adjustbox{max width=0.9\linewidth}{ + $\sup\limits_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_{t - 1}} \ln \cfrac{\sum\limits_{x_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x_{t - 1} | x_t]}{\sum\limits_{x'_{t - 1}} \underbrace{\Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x'_{t - 1}, \mathbb{D}_{t - 1}]}_{\alpha^B_{t - 1}} \underbrace{\Pr[x'_{t - 1} | x'_t]}_{P^B_{t - 1}}}$ + } \nonumber \\ + & \adjustbox{max width=0.4\linewidth}{ + $+ \underbrace{\sup\limits_{x_t, x'_t, \pmb{o}_t} \ln \cfrac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]}}_{\varepsilon_t}$ + } +\end{align} + +\begin{definition} + [Forward privacy loss (FPL)] + \label{def:fpl} + The potential privacy loss of a privacy mechanism at a timestamp $t \leq T$ due to outputs $\pmb{o}_t$,\dots,$\pmb{o}_T$ and temporal correlations in its input $D_t$ with respect to any adversary, targeting an individual with potential data item $x_t$ (or $x'_t$) and having knowledge $\mathbb{D}_t$ equal to $D_t - \{x_t\}$ (or $D'_t - \{x'_t\}$), is called forward privacy loss and is defined as: + + \begin{equation} + \label{eq:fpl-1} + \alpha^F_t = \sup_{x_t, x'_t, \pmb{o}_t, \dots, \pmb{o}_T} \ln \frac{\Pr[\pmb{o}_t, \dots, \pmb{o}_T | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t, \dots, \pmb{o}_T | x'_t, \mathbb{D}_t]} + \end{equation} +\end{definition} +% +Similar to the way that we concluded to Equation~\ref{eq:bpl-5} from Equation~\ref{eq:bpl-1} we can write Equation~\ref{eq:fpl-1} as follows: + +\begin{align} + \label{eq:fpl-2} + (\ref{eq:fpl-1}) = & + \adjustbox{max width=0.9\linewidth}{ + $\sup\limits_{x_t, x'_t, \pmb{o}_{t + 1}, \dots, \pmb{o}_T} \ln \cfrac{\sum\limits_{x_{t + 1}} \Pr[\pmb{o}_{t + 1}, \dots, \pmb{o}_T | x_{t + 1}, \mathbb{D}_{t + 1}] \Pr[x_{t + 1} | x_t]}{\sum\limits_{x'_{t + 1}} \underbrace{\Pr[\pmb{o}_{t + 1}, \dots, \pmb{o}_T | x'_{t + 1}, \mathbb{D}_{t + 1}]}_{\alpha^F_{t + 1}} \underbrace{\Pr[x'_{t + 1} | x'_t]}_{P^F_{t + 1}}}$ + } \nonumber \\ + & \adjustbox{max width=0.4\linewidth}{ + $+ \underbrace{\sup\limits_{x_t, x'_t, \pmb{o}_t} \ln \cfrac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]}}_{\varepsilon_t}$ + } +\end{align} + +Equations~\ref{eq:tpl-1},~\ref{eq:bpl-5}, and~\ref{eq:fpl-2} apply to the global publishing schema. +In the local schema, $D$ (or $D'$) is a single data item and is the same with $x$ (or $x'$), i.e.,~the possible data item of an individual user. +Therefore, we calculate the extra privacy loss under temporal correlations, due to an adversary that targets a user at a timestamp $t$, based on the assumption that their possible data are $D_t$ or $D'_t$. +More specifically, the calculation of TPL (Equation~\ref{eq:tpl-1}) becomes: +\begin{align} + \label{eq:tpl-local} + & \underbrace{\sup_{D_t, D'_t, \pmb{o}_1, \dots, \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_1, \dots, \pmb{o}_t| D_t]}{\Pr[\pmb{o}_1, \dots, \pmb{o}_t | D'_t]}}_{\text{Backward privacy loss}\ (\alpha^B_t)} \nonumber \\ + & + \underbrace{\sup_{D_t, D'_t, \pmb{o}_t, \dots, \pmb{o}_T} \ln \frac{\Pr[\pmb{o}_t, \dots, \pmb{o}_T| D_t]}{\Pr[\pmb{o}_t, \dots, \pmb{o}_T | D'_t]}}_{\text{Forward privacy loss}\ (\alpha^F_t)} \nonumber \\ + & - \underbrace{\sup_{D_t, D'_t, \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_t | D_t]}{\Pr[\pmb{o}_t | D'_t]}}_{\text{Present privacy loss}\ (\varepsilon_t)} +\end{align} +% +The calculation of BPL (Equation~\ref{eq:bpl-5}) becomes: +\begin{align} + \label{eq:bpl-local} + & \adjustbox{max width=0.9\linewidth}{ + $\sup\limits_{D_t, D'_t, \pmb{o}_1, \dots, \pmb{o}_{t - 1}} \ln \cfrac{\sum\limits_{D_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | D_{t - 1}, ] \Pr[D_{t - 1} | D_t]}{\sum\limits_{D'_{t - 1}} \underbrace{\Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | D'_{t - 1}, ]}_{\alpha^B_{t - 1}} \underbrace{\Pr[D'_{t - 1} | D'_t]}_{P^B_{t - 1}}}$ + } \nonumber \\ + & \adjustbox{max width=0.4\linewidth}{ + $+ \underbrace{\sup\limits_{D_t, D'_t, \pmb{o}_t} \ln \cfrac{\Pr[\pmb{o}_t | D_t]}{\Pr[\pmb{o}_t | D'_t]}}_{\varepsilon_t}$ + } +\end{align} +% +The calculation of FPL (Equation~\ref{eq:fpl-2}) becomes: +\begin{align} + \label{eq:fpl-local} + & \adjustbox{max width=0.9\linewidth}{ + $\sup\limits_{D_t, D'_t, \pmb{o}_{t + 1}, \dots, \pmb{o}_T} \ln \cfrac{\sum\limits_{D_{t + 1}} \Pr[\pmb{o}_{t + 1}, \dots, \pmb{o}_T | D_{t + 1}] \Pr[D_{t + 1} | D_t]}{\sum\limits_{D'_{t + 1}} \underbrace{\Pr[\pmb{o}_{t + 1}, \dots, \pmb{o}_T | D'_{t + 1}]}_{\alpha^F_{t + 1}} \underbrace{\Pr[D'_{t + 1} | D'_t]}_{P^F_{t + 1}}}$ + } \nonumber \\ + & \adjustbox{max width=0.4\linewidth}{ + $+ \underbrace{\sup\limits_{D_t, D'_t, \pmb{o}_t} \ln \cfrac{\Pr[\pmb{o}_t | D_t]}{\Pr[\pmb{o}_t | D'_t]}}_{\varepsilon_t}$ + } +\end{align} + +The authors propose solutions to bound the temporal privacy loss, under the presence of weak to moderate correlations, in both finite and infinite data publishing scenarios. +In the latter case, they try to find a value for $\varepsilon$ for which the backward and forward privacy loss are equal. +In the former, they similarly try to balance the backward and forward privacy loss while they allocate more $\varepsilon$ at the first and last timestamps, since they have higher impact to the privacy loss of the next and previous ones. +This way they achieve an overall constant temporal privacy loss throughout the time series. + +According to the technique's intuition, stronger correlations result in higher privacy loss. +However, the loss is less when the dimension of the transition matrix, which is extracted according to the modeling of the correlations (in this work they use Markov chains), is greater due to the fact that larger transition matrices tend to be uniform, resulting in weaker data dependence. +The authors investigate briefly all of the possible privacy levels; however, the solutions that they propose are applied only on the event-level. +Last but not least, the technique requires the calculation of the temporal privacy loss for every individual within the data set that might prove computationally inefficient in real-time scenarios. + +When dealing with temporally correlated data, we handle a sequence of $w \leq t \in \mathbb{Z}^+$ mechanisms (indexed by $m \in [1, t]$) as a single entity where each mechanism contributes to the temporal privacy loss depending on its order of application~\cite{cao2017quantifying}. +The first ($m - 1$ if $w \leq 2$ or $m - w + 1$ if $w > 2$) and last ($m$) mechanisms contribute to the backward and forward temporal privacy loss respectively. +When $w$ is greater than $2$, the rest of the mechanisms (between $m - w + 2$ and $m - 1$) contribute only to the privacy loss that is corresponding to the publication of the relevant data. + +\begin{theorem} + [Sequential composition under temporal correlations] + \label{theor:compo-seq-cor} + When a set of $w \leq t \in \mathbb{Z}^+$ independent privacy mechanisms, satisfying $\varepsilon_{m \in [1, t]}$-differential privacy, is applied over a sequence of an equal number of temporally correlated data sets, it provides a privacy guarantee equal to: + $$ + \begin{cases} + \alpha^B_{m - 1} + \alpha^F_m & \quad w \leq 2 \\ + \alpha^B_{m - w + 1} + \alpha^F_m + \sum_{i = m - w + 2}^{m - 1} \varepsilon_i & \quad w > 2 + \end{cases} + $$ +\end{theorem} + +Notice that the estimation of forward privacy loss is only pertinent to a setting under finite observation and moderate correlations. +In different circumstances, it might be impossible to calculate the upper bound of the temporal privacy loss, and thus only the backward privacy loss would be relevant. + +% Notice that---although we refer to it as `sequential'---since Theorem~\ref{theor:compo-seq-cor} refers to the application of a sequence of mechanisms to a respective sequence of disjoint data sets, we would normally expect it to correspond to the parallel composition on independent data (Theorem~\ref{theor:compo-par-ind}). +% However, due to the temporal correlations, the data sets are considered as one single data set; therefore, the application of a sequence of mechanisms can be handled according to the sequential composition on independent data (Theorem~\ref{theor:compo-seq-ind}). + + +\paragraph{Post-processing} +\label{subsec:p-proc} + +Every time a data publisher interacts with (any part of) the original data set, it is mandatory to consume some of the available privacy budget according to the composition theorems~\ref{theor:compo-seq-ind} and~\ref{theor:compo-par-ind}. +However, the \emph{post-processing} of a perturbed data set can be done without using any additional privacy budget. + +\begin{theorem} + [Post-processing] + \label{theor:post-processing} + The post-processing of any output of an $\varepsilon$-differential privacy mechanism shall not deteriorate its privacy guarantee. +\end{theorem} + +Naturally, using the same (or different) privacy mechanism(s) multiple times to interact with raw data in combination with already perturbed data implies that the privacy guarantee of the final output will be calculated according to Theorem~\ref{theor:compo-seq-ind}. + + \begin{example} \label{ex:application} To illustrate the usage of the microdata and statistical data techniques for privacy-preserving data publishing, we revisit Example~\ref{ex:continuous}. @@ -256,42 +527,7 @@ The technique adds random noise drawn from a multivariate Laplace distribution t Then, the reported data are collected by the central service, in order to be protected and then published, either as a whole, or as statistics thereof. Notice that in order to showcase the straightforward application of $k$-anonymity and differential privacy, we apply the two methods on each timestamp independently from the previous one, and do not take into account any additional threats imposed by continuity. - \begin{table} - \centering\noindent\adjustbox{max width=\linewidth} { - \begin{tabular}{@{}ccc@{}} - \begin{tabular}{@{}lrll@{}} - \toprule - \textit{Name} & \multicolumn{1}{c}{Age} & Location & Status \\ - \midrule - * & $> 20$ & Paris & at work \\ - * & $> 20$ & Paris & driving \\ - * & $> 20$ & Paris & dining \\ - \midrule - * & $\leq 20$ & Paris & running \\ - * & $\leq 20$ & Paris & at home \\ - * & $\leq 20$ & Paris & walking \\ - \bottomrule - \end{tabular} & - \begin{tabular}{@{}lrll@{}} - \toprule - \textit{Name} & \multicolumn{1}{c}{Age} & Location & Status \\ - \midrule - * & $> 20$ & Paris & driving \\ - * & $> 20$ & Paris & at the mall \\ - * & $> 20$ & Paris & biking \\ - \midrule - * & $\leq 20$ & Paris & sightseeing \\ - * & $\leq 20$ & Paris & walking \\ - * & $\leq 20$ & Paris & at home \\ - \bottomrule - \end{tabular} & - \dots \\ - $t_1$ & $ t_2$ & \\ - \end{tabular}% - }% - \caption{3-anonymous event-level protected versions of the microdata in Table~\ref{tab:continuous-micro}.} - \label{tab:scenario-micro} - \end{table} + \includetable{scenario-micro} First, we anonymize the data set of Table~\ref{tab:continuous-micro} using $k$-anonymity, with $k = 3$. This means that any user should not be distinguished from at least $2$ others. @@ -302,44 +538,7 @@ The technique adds random noise drawn from a multivariate Laplace distribution t Finally, we achieve $3$-anonymity by putting the entries in groups of three, according to the quasi-identifiers. Table~\ref{tab:scenario-micro} depicts the results at each timestamp. - \begin{table} - \centering - \subcaptionbox{True counts\label{tab:statistical-true}}{% - \begin{tabular}{@{}lr@{}} - \toprule - Location & \multicolumn{1}{c@{}}{Count} \\ - \midrule - Belleville & $1$ \\ - Latin Quarter & $1$ \\ - Le Marais & $1$ \\ - Montmartre & $2$ \\ - Opera & $1$ \\ - \bottomrule - \end{tabular}% - }\quad - \subcaptionbox*{}{% - \begin{tabular}{@{}c@{}} - \\ \\ \\ - $\xrightarrow[]{\text{Noise}}$ - \\ \\ \\ - \end{tabular}% - }\quad - \subcaptionbox{Perturbed counts\label{tab:statistical-noisy}}{% - \begin{tabular}{@{}lr@{}} - \toprule - Location & \multicolumn{1}{c@{}}{Count} \\ - \midrule - Belleville & $1$ \\ - Latin Quarter & $0$ \\ - Le Marais & $2$ \\ - Montmartre & $3$ \\ - Opera & $1$ \\ - \bottomrule - \end{tabular}% - }% - \caption{(a)~The original version of the data of Table~\ref{tab:continuous-statistical}, and (b)~their $1$-differentially event-level private version.} - \label{tab:scenario-statistical} - \end{table} + \includetable{scenario-statistical} Next, we demonstrate differential privacy. We apply an $\varepsilon$-differentially private Laplace mechanism, with $\varepsilon = 1$, taking into account the count query that generated the true counts of Table~\ref{tab:continuous-statistical}.