Merge branch 'master' of https://git.delkappa.com/manos/the-last-thing
This commit is contained in:
commit
84d33dd7f3
BIN
rslt/bgt_cmp/Geolife.pdf
Normal file
BIN
rslt/bgt_cmp/Geolife.pdf
Normal file
Binary file not shown.
BIN
rslt/bgt_cmp/T-drive.pdf
Normal file
BIN
rslt/bgt_cmp/T-drive.pdf
Normal file
Binary file not shown.
52
tables/continuous.tex
Normal file
52
tables/continuous.tex
Normal file
@ -0,0 +1,52 @@
|
|||||||
|
\begin{table}
|
||||||
|
\centering
|
||||||
|
\subcaptionbox{Microdata\label{tab:continuous-micro}}{%
|
||||||
|
\adjustbox{max width=\linewidth}{%
|
||||||
|
\begin{tabular}{@{}ccc@{}}
|
||||||
|
\begin{tabular}{@{}lrll@{}}
|
||||||
|
\toprule
|
||||||
|
\textit{Name} & \multicolumn{1}{c}{Age} & Location & Status \\
|
||||||
|
\midrule
|
||||||
|
Donald & $27$ & Le Marais & at work \\
|
||||||
|
Daisy & $25$ & Belleville & driving \\
|
||||||
|
Huey & $12$ & Montmartre & running \\
|
||||||
|
Dewey & $11$ & Montmartre & at home \\
|
||||||
|
Louie & $10$ & Latin Quarter & walking \\
|
||||||
|
Quackmore & $62$ & Opera & dining \\
|
||||||
|
\bottomrule
|
||||||
|
\multicolumn{4}{c}{$t_1$} \\
|
||||||
|
\end{tabular} &
|
||||||
|
\begin{tabular}{@{}lrll@{}}
|
||||||
|
\toprule
|
||||||
|
\textit{Name} & \multicolumn{1}{c}{Age} & Location & Status \\
|
||||||
|
\midrule
|
||||||
|
Donald & $27$ & Montmartre & driving \\
|
||||||
|
Daisy & $25$ & Montmartre & at the mall \\
|
||||||
|
Huey & $12$ & Latin Quarter & sightseeing \\
|
||||||
|
Dewey & $11$ & Opera & walking \\
|
||||||
|
Louie & $10$ & Latin Quarter & at home \\
|
||||||
|
Quackmore & $62$ & Montmartre & biking \\
|
||||||
|
\bottomrule
|
||||||
|
\multicolumn{4}{c}{$t_2$} \\
|
||||||
|
\end{tabular} &
|
||||||
|
\dots
|
||||||
|
\end{tabular}%
|
||||||
|
}%
|
||||||
|
} \\ \bigskip
|
||||||
|
\subcaptionbox{Statistical data\label{tab:continuous-statistical}}{%
|
||||||
|
\begin{tabular}{@{}lrrr@{}}
|
||||||
|
\toprule
|
||||||
|
\multirow{2}{*}{Location} & \multicolumn{3}{c@{}}{Count}\\
|
||||||
|
& \multicolumn{1}{c}{$t_1$} & \multicolumn{1}{c}{$t_2$} & \dots \\
|
||||||
|
\midrule
|
||||||
|
Belleville & $1$ & $0$ & \dots \\
|
||||||
|
Latin Quarter & $1$ & $2$ & \dots \\
|
||||||
|
Le Marais & $1$ & $0$ & \dots \\
|
||||||
|
Montmartre & $2$ & $3$ & \dots \\
|
||||||
|
Opera & $1$ & $1$ & \dots \\
|
||||||
|
\bottomrule
|
||||||
|
\end{tabular}%
|
||||||
|
}%
|
||||||
|
\caption{Continuous data observation of (a)~microdata, and corresponding (b)~statistics at multiple timestamps.}
|
||||||
|
\label{tab:continuous}
|
||||||
|
\end{table}
|
@ -68,5 +68,18 @@ Typically, in such cases, we have a collection of data referring to the same ind
|
|||||||
Additionally, in many cases, the privacy-preserving processes should take into account implicit correlations and restrictions that exist, e.g.,~space-imposed collocation or movement restrictions.
|
Additionally, in many cases, the privacy-preserving processes should take into account implicit correlations and restrictions that exist, e.g.,~space-imposed collocation or movement restrictions.
|
||||||
Since these data are related to most of the important applications and services that enjoy high utilization rates, privacy-preserving continuous data publishing becomes one of the emblematic problems of our time.
|
Since these data are related to most of the important applications and services that enjoy high utilization rates, privacy-preserving continuous data publishing becomes one of the emblematic problems of our time.
|
||||||
|
|
||||||
|
To accompany and facilitate the descriptions in this chapter, we provide the following running example.
|
||||||
|
|
||||||
|
\begin{example}
|
||||||
|
\label{ex:snapshot}
|
||||||
|
Users interact with an LBS by making queries in order to retrieve some useful location-based information or just reporting user-state at various locations.
|
||||||
|
This user--LBS interaction generates user-related data, organized in a schema with the following attributes: \emph{Name} (the unique identifier of the table), \emph{Age}, \emph{Location}, and \emph{Status} (Table~\ref{tab:snapshot-micro}).
|
||||||
|
The `Status' attribute includes information that characterizes the user's state or the query itself, and its value varies according to the service functionality.
|
||||||
|
Subsequently, the generated data are aggregated (by issuing count queries over them) in order to derive useful information about the popularity of the venues during the day (Table~\ref{tab:snapshot-statistical}).
|
||||||
|
|
||||||
|
\includetable{snapshot}
|
||||||
|
|
||||||
|
\end{example}
|
||||||
|
|
||||||
\input{introduction/contribution}
|
\input{introduction/contribution}
|
||||||
\input{introduction/structure}
|
\input{introduction/structure}
|
||||||
|
230
text/preliminaries/correlation.tex
Normal file
230
text/preliminaries/correlation.tex
Normal file
@ -0,0 +1,230 @@
|
|||||||
|
\section{Data dependence and correlation}
|
||||||
|
\label{sec:correlation}
|
||||||
|
|
||||||
|
|
||||||
|
\subsection{Types of correlation}
|
||||||
|
|
||||||
|
The most prominent types of correlations might be:
|
||||||
|
|
||||||
|
\begin{itemize}
|
||||||
|
\item \emph{temporal}~\cite{wei2006time}---appearing in observations (i.e.,~values) of the same object over time.
|
||||||
|
\item \emph{Spatial}~\cite{legendre1993spatial, anselin1995local}---denoted by the degree of similarity of nearby data points in space, and indicating if and how phenomena relate to the (broader) area where they take place.
|
||||||
|
\item \emph{Spatiotemporal}---a combination of the previous categories, appearing when processing time series or sequences of human activities with geolocation characteristics, e.g.,~\cite{ghinita2009preventing}.
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
Contrary to one-dimensional correlations, spatial correlation is multi-dimensional and multi-directional, and can be measured by indicators (e.g.,~\emph{Moran's I}~\cite{moran1950notes}) that reflect the \emph{spatial association} of the concerned data.
|
||||||
|
Spatial autocorrelation has its foundations in the \emph{First Law of Geography} stating that ``everything is related to everything else, but near things are more related than distant things''~\cite{tobler1970computer}.
|
||||||
|
A positive spatial autocorrelation indicates that similar data are \emph{clustered}, a negative that data are dispersed and are close to dissimilar ones, and when close to zero, that data are \emph{randomly arranged} in space.
|
||||||
|
|
||||||
|
|
||||||
|
\subsection{Extraction of correlation}
|
||||||
|
|
||||||
|
A common practice for extracting data dependencies from continuous data, is by expressing the data as a \emph{stochastic} or \emph{random process}.
|
||||||
|
A random process is a collection of \emph{random variables} or \emph{bivariate data}, indexed by some set, e.g.,~a series of timestamps, a Cartesian plane $\mathbb{R}^2$, an $n$-dimensional Euclidean space, etc.~\cite{skorokhod2005basic}.
|
||||||
|
The values a random variable can take are outcomes of an unpredictable process, while bivariate data are pairs of data values with a possible association between them.
|
||||||
|
Expressing data as stochastic processes allows their modeling depending on their properties, and thereafter the discovery of relevant data dependencies.
|
||||||
|
|
||||||
|
Some common stochastic processes modeling techniques include:
|
||||||
|
|
||||||
|
\begin{itemize}
|
||||||
|
\item \emph{Conditional probabilities}~\cite{allan2013probability}---probabilities of events in the presence of other events.
|
||||||
|
\item \emph{Conditional Random Fields} (CRFs)~\cite{lafferty2001conditional}---undirected graphs encoding conditional probability distributions.
|
||||||
|
\item \emph{Markov processes}~\cite{rogers2000diffusions}---stochastic processes for which the conditional probability of their future states depends only on the present state and it is independent of its previous states (\emph{Markov assumption}).
|
||||||
|
\begin{itemize}
|
||||||
|
\item \emph{Markov chains}~\cite{gagniuc2017markov}---sequences of possible events whose probability depends on the state attained in the previous event.
|
||||||
|
\item \emph{Hidden Markov Models} (HMMs)~\cite{baum1966statistical}---statistical Markov models of Markov processes with unobserved states.
|
||||||
|
\end{itemize}
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
|
||||||
|
\subsection{Privacy risks of correlation}
|
||||||
|
|
||||||
|
Data dependence might appear:
|
||||||
|
|
||||||
|
\begin{itemize}
|
||||||
|
\item within one data set, and
|
||||||
|
\item within one data set and among one data set and previous data releases, and/or other external sources~\cite{kifer2011no, chen2014correlated, liu2016dependence, zhao2017dependent}.
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
In the former case, data tuples and data values within a data set may be correlated, or linked in such a way that information about one person can be inferred even if the person is absent from the database.
|
||||||
|
Consequently, in this category we put assumptions made on the data generation model based on randomness, like the random world model, the independent and identically distributed data (i.i.d.) model, or the independent-tuples model, which may be unrealistic for many real-world scenarios.
|
||||||
|
This attack is also known as the \emph{deFinetti's attack}~\cite{kifer2009attacks}.
|
||||||
|
|
||||||
|
In the latter case, the strength of the dependence between a pair of variables can be quantified with the utilization of \emph{correlations}~\cite{stigler1989francis}.
|
||||||
|
Correlation implies dependence but not vice versa, however, the two terms are often used as synonyms.
|
||||||
|
The correlation among nearby observations, i.e.,~the elements in a series of data points, are referenced as \emph{autocorrelation} or \emph{serial correlation}~\cite{park2018fundamentals}.
|
||||||
|
Depending on the evaluation technique, e.g.,~\emph{Pearson's correlation coefficient}~\cite{stigler1989francis}, a correlation can be characterized as \emph{negative}, \emph{zero}, or \emph{positive}.
|
||||||
|
A negative value shows that the behavior of one variable is the \emph{opposite} of that of the other, e.g.,~when the one increases the other decreases.
|
||||||
|
Zero means that the variables are not linked and are \emph{independent} of each other.
|
||||||
|
A positive correlation indicates that the variables behave in a \emph{similar} manner, e.g.,~when the one decreases the other decreases as well.
|
||||||
|
|
||||||
|
|
||||||
|
\subsection{Privacy loss under temporal correlation}
|
||||||
|
|
||||||
|
% The presence of temporal correlations might result into additional privacy loss consisting of \emph{backward privacy loss} $\alpha^B$ and \emph{forward privacy loss} $\alpha^F$~\cite{cao2017quantifying}.
|
||||||
|
Cao et al.~\cite{cao2017quantifying} propose a method for computing the temporal privacy loss (TPL) of a differential privacy mechanism in the presence of temporal correlations and background knowledge.
|
||||||
|
The goal of their technique is to guarantee privacy protection and to bound the privacy loss at every timestamp under the assumption of independent data releases.
|
||||||
|
It calculates the temporal privacy loss as the sum of the backward and forward privacy loss minus the default privacy loss $\varepsilon$ of the mechanism (because it is counted twice in the aforementioned entities).
|
||||||
|
This calculation is done for each individual that is included in the original data set and the overall temporal privacy loss is equal to the maximum calculated value at every timestamp.
|
||||||
|
The backward/forward privacy loss at any timestamp depends on the backward/forward privacy loss at the previous/next timestamp, the backward/forward temporal correlations, and $\varepsilon$.
|
||||||
|
|
||||||
|
\begin{definition}
|
||||||
|
[Temporal privacy loss (TPL)]
|
||||||
|
\label{def:tpl}
|
||||||
|
The potential privacy loss of a privacy mechanism at a timestamp $t \leq T$ due to a series of outputs $\pmb{o}_1$, \dots, $\pmb{o}_T$ and temporal correlations in its input $D_t$ with respect to any adversary, targeting an individual with potential data items $x_t$ (or $x'_t$) and having knowledge $\mathbb{D}_t$ equal to $D_t - \{x_t\}$ (or $D'_t - \{x'_t\}$), is defined as:
|
||||||
|
|
||||||
|
\begin{equation}
|
||||||
|
\label{eq:tpl}
|
||||||
|
\alpha_t = \sup_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_T} \ln \frac{\Pr[\pmb{o}_1, \dots, \pmb{o}_T | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_1, \dots, \pmb{o}_T | x'_t, \mathbb{D}_t]}
|
||||||
|
\end{equation}
|
||||||
|
\end{definition}
|
||||||
|
%
|
||||||
|
By analyzing Equation~\ref{eq:tpl} we get the following:
|
||||||
|
|
||||||
|
\begin{align}
|
||||||
|
\label{eq:tpl-1}
|
||||||
|
(\ref{eq:tpl}) = & \underbrace{\sup_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_1, \dots, \pmb{o}_t| x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_1, \dots, \pmb{o}_t | x'_t, \mathbb{D}_t]}}_{\text{Backward privacy loss}\ (\alpha^B_t)} \nonumber \\
|
||||||
|
& + \underbrace{\sup_{x_t, x'_t, \pmb{o}_t, \dots, \pmb{o}_T} \ln \frac{\Pr[\pmb{o}_t, \dots, \pmb{o}_T| x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t, \dots, \pmb{o}_T | x'_t, \mathbb{D}_t]}}_{\text{Forward privacy loss}\ (\alpha^F_t)} \nonumber \\
|
||||||
|
& - \underbrace{\sup_{x_t, x'_t, \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t] }{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]}}_{\text{Present privacy loss}\ (\varepsilon_t)}
|
||||||
|
\end{align}
|
||||||
|
|
||||||
|
\begin{definition}
|
||||||
|
[Backward privacy loss (BPL)]
|
||||||
|
\label{def:bpl}
|
||||||
|
The potential privacy loss of a privacy mechanism at a timestamp $t \leq T$ due to outputs $\pmb{o}_1$, \dots, $\pmb{o}_t$ and temporal correlations in its input $D_t$ with respect to any adversary, targeting an individual with potential data items $x_t$ (or $x'_t$) and having knowledge $\mathbb{D}_t$ equal to $D_t - \{x_t\}$ (or $D'_t - \{x'_t\}$), is called backward privacy loss and is defined as:
|
||||||
|
|
||||||
|
\begin{equation}
|
||||||
|
\label{eq:bpl-1}
|
||||||
|
\alpha^B_t = \sup_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_1, \dots, \pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_1, \dots, \pmb{o}_t | x'_t, \mathbb{D}_t]}
|
||||||
|
\end{equation}
|
||||||
|
|
||||||
|
\end{definition}
|
||||||
|
%
|
||||||
|
From differential privacy we have the assumption that $\pmb{o}_1$, \dots, $\pmb{o}_t$ are independent events.
|
||||||
|
Therefore, according to the Bayesian theorem, we can write Equation~\ref{eq:bpl-1} as:
|
||||||
|
|
||||||
|
\begin{align}
|
||||||
|
\label{eq:bpl-2}
|
||||||
|
(\ref{eq:bpl-1}) = & \sup_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1}| x_t, \mathbb{D}_t] \Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x'_t, \mathbb{D}_t] \Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]} \nonumber \\
|
||||||
|
= & \sup_{x_t, x_t', \pmb{o}_1, \dots, \pmb{o}_{t - 1}} \ln \frac{\Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x_t', \mathbb{D}_t]} \nonumber \\
|
||||||
|
& + \sup_{x_t, x_t', \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]}
|
||||||
|
\end{align}
|
||||||
|
%
|
||||||
|
Applying the law of total probability to the first term of Equation~\ref{eq:bpl-2} for all the possible data $x_{t - 1}$ (or $x'_{t - 1}$) and $\mathbb{D}_{t - 1}$ we get the following:
|
||||||
|
|
||||||
|
\begin{align}
|
||||||
|
\label{eq:bpl-3}
|
||||||
|
(\ref{eq:bpl-2}) = &
|
||||||
|
\adjustbox{max width=0.9\linewidth}{
|
||||||
|
$\sup\limits_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_{t - 1}} \ln \frac{\sum\limits_{x_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x_t, \mathbb{D}_t, x_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x_{t - 1}, \mathbb{D}_{t - 1} | x_t, \mathbb{D}_t]}{\sum\limits_{x'_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x'_t, \mathbb{D}_t, x'_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x'_{t - 1}, \mathbb{D}_{t - 1} | x'_t, \mathbb{D}_t]}$
|
||||||
|
} \nonumber \\
|
||||||
|
& \adjustbox{max width=0.3\linewidth}{
|
||||||
|
$+ \sup\limits_{x_t, x_t', \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]}$
|
||||||
|
}
|
||||||
|
\end{align}
|
||||||
|
%
|
||||||
|
Since $\mathbb{D}_t$ is equal to $D_t - \{x_t\}$ (or $D'_t - \{x'_t\}$), and thus is constant and independent of every possible $x_t$ (or $x'_t$), $\forall t \leq T$, Equation~\ref{eq:bpl-3} can be written as:
|
||||||
|
|
||||||
|
\begin{align}
|
||||||
|
\label{eq:bpl-4}
|
||||||
|
(\ref{eq:bpl-3}) = &
|
||||||
|
\adjustbox{max width=0.9\linewidth}{
|
||||||
|
$\sup\limits_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_{t - 1}} \ln \frac{\sum\limits_{x_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x_t, \mathbb{D}_t, x_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x_{t - 1} | x_t, \mathbb{D}_t] \Pr[\mathbb{D}_{t - 1} | x_t, \mathbb{D}_t]}{\sum\limits_{x'_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x'_t, \mathbb{D}_t, x'_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x'_{t - 1} | x'_t, \mathbb{D}_t] \Pr[\mathbb{D}_{t - 1} | x'_t, \mathbb{D}_t]}$
|
||||||
|
} \nonumber \\
|
||||||
|
& \adjustbox{max width=0.275\linewidth}{
|
||||||
|
$+ \sup\limits_{x_t, x_t', \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]}$
|
||||||
|
} \nonumber \\
|
||||||
|
= & \adjustbox{max width=0.825\linewidth}{
|
||||||
|
$\sup\limits_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_{t - 1}} \ln \frac{\sum\limits_{x_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x_t, \mathbb{D}_t, x_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x_{t - 1} | x_t] \Pr[\mathbb{D}_{t - 1} | \mathbb{D}_t]}{\sum\limits_{x'_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x'_t, \mathbb{D}_t, x'_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x'_{t - 1} | x'_t] \Pr[\mathbb{D}_{t - 1} | \mathbb{D}_t]}$
|
||||||
|
} \nonumber \\
|
||||||
|
& \adjustbox{max width=0.275\linewidth}{
|
||||||
|
$+ \sup\limits_{x_t, x_t', \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]}$
|
||||||
|
} \nonumber \\
|
||||||
|
= & \adjustbox{max width=0.7\linewidth}{
|
||||||
|
$\sup\limits_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_{t - 1}} \ln \frac{\sum\limits_{x_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x_t, \mathbb{D}_t, x_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x_{t - 1} | x_t]}{\sum\limits_{x'_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x'_t, \mathbb{D}_t, x'_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x'_{t - 1} | x'_t]}$
|
||||||
|
} \nonumber \\
|
||||||
|
& \adjustbox{max width=0.275\linewidth}{
|
||||||
|
$+ \sup\limits_{x_t, x_t', \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]}$
|
||||||
|
}
|
||||||
|
\end{align}
|
||||||
|
%
|
||||||
|
The outputs $\pmb{o}_1$, \dots, $\pmb{o}_t$ and $x_t$ (or $x'_t$) are conditionally independent in the presence of
|
||||||
|
$x_{t - 1}$ (or $x'_{t - 1}$), and thus Equation~\ref{eq:bpl-4} can be written as:
|
||||||
|
|
||||||
|
\begin{align}
|
||||||
|
\label{eq:bpl-5}
|
||||||
|
(\ref{eq:bpl-4}) = &
|
||||||
|
\adjustbox{max width=0.9\linewidth}{
|
||||||
|
$\sup\limits_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_{t - 1}} \ln \cfrac{\sum\limits_{x_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x_{t - 1} | x_t]}{\sum\limits_{x'_{t - 1}} \underbrace{\Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x'_{t - 1}, \mathbb{D}_{t - 1}]}_{\alpha^B_{t - 1}} \underbrace{\Pr[x'_{t - 1} | x'_t]}_{P^B_{t - 1}}}$
|
||||||
|
} \nonumber \\
|
||||||
|
& \adjustbox{max width=0.4\linewidth}{
|
||||||
|
$+ \underbrace{\sup\limits_{x_t, x'_t, \pmb{o}_t} \ln \cfrac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]}}_{\varepsilon_t}$
|
||||||
|
}
|
||||||
|
\end{align}
|
||||||
|
|
||||||
|
\begin{definition}
|
||||||
|
[Forward privacy loss (FPL)]
|
||||||
|
\label{def:fpl}
|
||||||
|
The potential privacy loss of a privacy mechanism at a timestamp $t \leq T$ due to outputs $\pmb{o}_t$,\dots,$\pmb{o}_T$ and temporal correlations in its input $D_t$ with respect to any adversary, targeting an individual with potential data item $x_t$ (or $x'_t$) and having knowledge $\mathbb{D}_t$ equal to $D_t - \{x_t\}$ (or $D'_t - \{x'_t\}$), is called forward privacy loss and is defined as:
|
||||||
|
|
||||||
|
\begin{equation}
|
||||||
|
\label{eq:fpl-1}
|
||||||
|
\alpha^F_t = \sup_{x_t, x'_t, \pmb{o}_t, \dots, \pmb{o}_T} \ln \frac{\Pr[\pmb{o}_t, \dots, \pmb{o}_T | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t, \dots, \pmb{o}_T | x'_t, \mathbb{D}_t]}
|
||||||
|
\end{equation}
|
||||||
|
\end{definition}
|
||||||
|
%
|
||||||
|
Similar to the way that we concluded to Equation~\ref{eq:bpl-5} from Equation~\ref{eq:bpl-1} we can write Equation~\ref{eq:fpl-1} as follows:
|
||||||
|
|
||||||
|
\begin{align}
|
||||||
|
\label{eq:fpl-2}
|
||||||
|
(\ref{eq:fpl-1}) = &
|
||||||
|
\adjustbox{max width=0.9\linewidth}{
|
||||||
|
$\sup\limits_{x_t, x'_t, \pmb{o}_{t + 1}, \dots, \pmb{o}_T} \ln \cfrac{\sum\limits_{x_{t + 1}} \Pr[\pmb{o}_{t + 1}, \dots, \pmb{o}_T | x_{t + 1}, \mathbb{D}_{t + 1}] \Pr[x_{t + 1} | x_t]}{\sum\limits_{x'_{t + 1}} \underbrace{\Pr[\pmb{o}_{t + 1}, \dots, \pmb{o}_T | x'_{t + 1}, \mathbb{D}_{t + 1}]}_{\alpha^F_{t + 1}} \underbrace{\Pr[x'_{t + 1} | x'_t]}_{P^F_{t + 1}}}$
|
||||||
|
} \nonumber \\
|
||||||
|
& \adjustbox{max width=0.4\linewidth}{
|
||||||
|
$+ \underbrace{\sup\limits_{x_t, x'_t, \pmb{o}_t} \ln \cfrac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]}}_{\varepsilon_t}$
|
||||||
|
}
|
||||||
|
\end{align}
|
||||||
|
|
||||||
|
Equations~\ref{eq:tpl-1},~\ref{eq:bpl-5}, and~\ref{eq:fpl-2} apply to the global publishing schema.
|
||||||
|
In the local schema, $D$ (or $D'$) is a single data item and is the same with $x$ (or $x'$), i.e.,~the possible data item of an individual user.
|
||||||
|
Therefore, we calculate the extra privacy loss under temporal correlations, due to an adversary that targets a user at a timestamp $t$, based on the assumption that their possible data are $D_t$ or $D'_t$.
|
||||||
|
More specifically, the calculation of TPL (Equation~\ref{eq:tpl-1}) becomes:
|
||||||
|
\begin{align}
|
||||||
|
\label{eq:tpl-local}
|
||||||
|
& \underbrace{\sup_{D_t, D'_t, \pmb{o}_1, \dots, \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_1, \dots, \pmb{o}_t| D_t]}{\Pr[\pmb{o}_1, \dots, \pmb{o}_t | D'_t]}}_{\text{Backward privacy loss}\ (\alpha^B_t)} \nonumber \\
|
||||||
|
& + \underbrace{\sup_{D_t, D'_t, \pmb{o}_t, \dots, \pmb{o}_T} \ln \frac{\Pr[\pmb{o}_t, \dots, \pmb{o}_T| D_t]}{\Pr[\pmb{o}_t, \dots, \pmb{o}_T | D'_t]}}_{\text{Forward privacy loss}\ (\alpha^F_t)} \nonumber \\
|
||||||
|
& - \underbrace{\sup_{D_t, D'_t, \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_t | D_t]}{\Pr[\pmb{o}_t | D'_t]}}_{\text{Present privacy loss}\ (\varepsilon_t)}
|
||||||
|
\end{align}
|
||||||
|
%
|
||||||
|
The calculation of BPL (Equation~\ref{eq:bpl-5}) becomes:
|
||||||
|
\begin{align}
|
||||||
|
\label{eq:bpl-local}
|
||||||
|
& \adjustbox{max width=0.9\linewidth}{
|
||||||
|
$\sup\limits_{D_t, D'_t, \pmb{o}_1, \dots, \pmb{o}_{t - 1}} \ln \cfrac{\sum\limits_{D_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | D_{t - 1}, ] \Pr[D_{t - 1} | D_t]}{\sum\limits_{D'_{t - 1}} \underbrace{\Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | D'_{t - 1}, ]}_{\alpha^B_{t - 1}} \underbrace{\Pr[D'_{t - 1} | D'_t]}_{P^B_{t - 1}}}$
|
||||||
|
} \nonumber \\
|
||||||
|
& \adjustbox{max width=0.4\linewidth}{
|
||||||
|
$+ \underbrace{\sup\limits_{D_t, D'_t, \pmb{o}_t} \ln \cfrac{\Pr[\pmb{o}_t | D_t]}{\Pr[\pmb{o}_t | D'_t]}}_{\varepsilon_t}$
|
||||||
|
}
|
||||||
|
\end{align}
|
||||||
|
%
|
||||||
|
The calculation of FPL (Equation~\ref{eq:fpl-2}) becomes:
|
||||||
|
\begin{align}
|
||||||
|
\label{eq:fpl-local}
|
||||||
|
& \adjustbox{max width=0.9\linewidth}{
|
||||||
|
$\sup\limits_{D_t, D'_t, \pmb{o}_{t + 1}, \dots, \pmb{o}_T} \ln \cfrac{\sum\limits_{D_{t + 1}} \Pr[\pmb{o}_{t + 1}, \dots, \pmb{o}_T | D_{t + 1}] \Pr[D_{t + 1} | D_t]}{\sum\limits_{D'_{t + 1}} \underbrace{\Pr[\pmb{o}_{t + 1}, \dots, \pmb{o}_T | D'_{t + 1}]}_{\alpha^F_{t + 1}} \underbrace{\Pr[D'_{t + 1} | D'_t]}_{P^F_{t + 1}}}$
|
||||||
|
} \nonumber \\
|
||||||
|
& \adjustbox{max width=0.4\linewidth}{
|
||||||
|
$+ \underbrace{\sup\limits_{D_t, D'_t, \pmb{o}_t} \ln \cfrac{\Pr[\pmb{o}_t | D_t]}{\Pr[\pmb{o}_t | D'_t]}}_{\varepsilon_t}$
|
||||||
|
}
|
||||||
|
\end{align}
|
||||||
|
|
||||||
|
The authors propose solutions to bound the temporal privacy loss, under the presence of weak to moderate correlations, in both finite and infinite data publishing scenarios.
|
||||||
|
In the latter case, they try to find a value for $\varepsilon$ for which the backward and forward privacy loss are equal.
|
||||||
|
In the former, they similarly try to balance the backward and forward privacy loss while they allocate more $\varepsilon$ at the first and last timestamps, since they have higher impact to the privacy loss of the next and previous ones.
|
||||||
|
This way they achieve an overall constant temporal privacy loss throughout the time series.
|
||||||
|
|
||||||
|
According to the technique's intuition, stronger correlations result in higher privacy loss.
|
||||||
|
However, the loss is less when the dimension of the transition matrix, which is extracted according to the modeling of the correlations (in this work they use Markov chains), is greater due to the fact that larger transition matrices tend to be uniform, resulting in weaker data dependence.
|
||||||
|
The authors investigate briefly all of the possible privacy levels; however, the solutions that they propose are applied only on the event-level.
|
||||||
|
Last but not least, the technique requires the calculation of the temporal privacy loss for every individual within the data set that might prove computationally inefficient in real-time scenarios.
|
@ -1,11 +1,10 @@
|
|||||||
\section{Data}
|
\section{Types of data sets}
|
||||||
\label{sec:data}
|
\label{sec:data}
|
||||||
|
|
||||||
|
\subsection{Data categories}
|
||||||
\subsection{Categories}
|
|
||||||
\label{subsec:data-categories}
|
\label{subsec:data-categories}
|
||||||
|
|
||||||
As this survey is about privacy, the data that we are interested in, contain information about individuals and their actions.
|
The data that we are interested in, contain information about individuals and their actions.
|
||||||
We firstly classify the data based on their content:
|
We firstly classify the data based on their content:
|
||||||
|
|
||||||
\begin{itemize}
|
\begin{itemize}
|
||||||
@ -28,58 +27,8 @@ Depending on the span of observation, we distinguish the following categories:
|
|||||||
The two data tables, over the time-span $[t_1, t_2]$ are an example of finite data.
|
The two data tables, over the time-span $[t_1, t_2]$ are an example of finite data.
|
||||||
Infinite data are the whole series of data obtained over the period~$[t_1, \infty)$ (infinity is denoted by `\dots').
|
Infinite data are the whole series of data obtained over the period~$[t_1, \infty)$ (infinity is denoted by `\dots').
|
||||||
|
|
||||||
\begin{table}
|
\includetable{continuous}
|
||||||
\centering
|
|
||||||
\subcaptionbox{Microdata\label{tab:continuous-micro}}{%
|
|
||||||
\adjustbox{max width=\linewidth}{%
|
|
||||||
\begin{tabular}{@{}ccc@{}}
|
|
||||||
\begin{tabular}{@{}lrll@{}}
|
|
||||||
\toprule
|
|
||||||
\textit{Name} & \multicolumn{1}{c}{Age} & Location & Status \\
|
|
||||||
\midrule
|
|
||||||
Donald & $27$ & Le Marais & at work \\
|
|
||||||
Daisy & $25$ & Belleville & driving \\
|
|
||||||
Huey & $12$ & Montmartre & running \\
|
|
||||||
Dewey & $11$ & Montmartre & at home \\
|
|
||||||
Louie & $10$ & Latin Quarter & walking \\
|
|
||||||
Quackmore & $62$ & Opera & dining \\
|
|
||||||
\bottomrule
|
|
||||||
\multicolumn{4}{c}{$t_1$} \\
|
|
||||||
\end{tabular} &
|
|
||||||
\begin{tabular}{@{}lrll@{}}
|
|
||||||
\toprule
|
|
||||||
\textit{Name} & \multicolumn{1}{c}{Age} & Location & Status \\
|
|
||||||
\midrule
|
|
||||||
Donald & $27$ & Montmartre & driving \\
|
|
||||||
Daisy & $25$ & Montmartre & at the mall \\
|
|
||||||
Huey & $12$ & Latin Quarter & sightseeing \\
|
|
||||||
Dewey & $11$ & Opera & walking \\
|
|
||||||
Louie & $10$ & Latin Quarter & at home \\
|
|
||||||
Quackmore & $62$ & Montmartre & biking \\
|
|
||||||
\bottomrule
|
|
||||||
\multicolumn{4}{c}{$t_2$} \\
|
|
||||||
\end{tabular} &
|
|
||||||
\dots
|
|
||||||
\end{tabular}%
|
|
||||||
}%
|
|
||||||
} \\ \bigskip
|
|
||||||
\subcaptionbox{Statistical data\label{tab:continuous-statistical}}{%
|
|
||||||
\begin{tabular}{@{}lrrr@{}}
|
|
||||||
\toprule
|
|
||||||
\multirow{2}{*}{Location} & \multicolumn{3}{c@{}}{Count}\\
|
|
||||||
& \multicolumn{1}{c}{$t_1$} & \multicolumn{1}{c}{$t_2$} & \dots \\
|
|
||||||
\midrule
|
|
||||||
Belleville & $1$ & $0$ & \dots \\
|
|
||||||
Latin Quarter & $1$ & $2$ & \dots \\
|
|
||||||
Le Marais & $1$ & $0$ & \dots \\
|
|
||||||
Montmartre & $2$ & $3$ & \dots \\
|
|
||||||
Opera & $1$ & $1$ & \dots \\
|
|
||||||
\bottomrule
|
|
||||||
\end{tabular}%
|
|
||||||
}%
|
|
||||||
\caption{Continuous data observation of (a)~microdata, and corresponding (b)~statistics at multiple timestamps.}
|
|
||||||
\label{tab:continuous}
|
|
||||||
\end{table}
|
|
||||||
\end{example}
|
\end{example}
|
||||||
|
|
||||||
We further define two sub-categories applicable to both finite and infinite data: \emph{sequential} and \emph{incremental} data; these two subcategories are not exhaustive, i.e.,~not all data sets belong to the one or the other category.
|
We further define two sub-categories applicable to both finite and infinite data: \emph{sequential} and \emph{incremental} data; these two subcategories are not exhaustive, i.e.,~not all data sets belong to the one or the other category.
|
||||||
@ -89,7 +38,7 @@ In incremental data, an original data set is augmented in each subsequent timest
|
|||||||
For example, trajectories can be considered as incremental data, when at each timestamp we consider all the previously visited locations by an individual, incremented by his current position.
|
For example, trajectories can be considered as incremental data, when at each timestamp we consider all the previously visited locations by an individual, incremented by his current position.
|
||||||
|
|
||||||
|
|
||||||
\subsection{Processing and publishing}
|
\subsection{Data processing and publishing}
|
||||||
\label{subsec:data-publishing}
|
\label{subsec:data-publishing}
|
||||||
|
|
||||||
We categorize data processing and publishing based on the implemented scheme, as:
|
We categorize data processing and publishing based on the implemented scheme, as:
|
||||||
@ -125,7 +74,7 @@ Nonetheless, data distortion at an early stage might prove detrimental to the ov
|
|||||||
The so far consensus is that there is no overall optimal solution among the two designs.
|
The so far consensus is that there is no overall optimal solution among the two designs.
|
||||||
Most service-providing companies prefer the global scheme, mainly for reasons of better management and control over the data, while several privacy advocates support the local privacy scheme that offers users full control over what and how data are published.
|
Most service-providing companies prefer the global scheme, mainly for reasons of better management and control over the data, while several privacy advocates support the local privacy scheme that offers users full control over what and how data are published.
|
||||||
Although there have been attempts to bridge the gap between them, e.g.,~\cite{bittau2017prochlo}, the global scheme is considerably better explored and implemented~\cite{satyanarayanan2017emergence}.
|
Although there have been attempts to bridge the gap between them, e.g.,~\cite{bittau2017prochlo}, the global scheme is considerably better explored and implemented~\cite{satyanarayanan2017emergence}.
|
||||||
For this reason, most of the works in this survey span this context.
|
For this reason, most of the works in our work span this context.
|
||||||
|
|
||||||
We distinguish between two publishing modes for private data: \emph{snapshot} and \emph{continuous}.
|
We distinguish between two publishing modes for private data: \emph{snapshot} and \emph{continuous}.
|
||||||
In snapshot publishing (also appearing as \emph{one-shot} or \emph{one-off} publishing), the system processes and releases a data set at a specific point in time and thereafter is not concerned anymore with the specific data set.
|
In snapshot publishing (also appearing as \emph{one-shot} or \emph{one-off} publishing), the system processes and releases a data set at a specific point in time and thereafter is not concerned anymore with the specific data set.
|
||||||
@ -133,7 +82,7 @@ For example, in Figure~\ref{fig:mode-snapshot} (ignore the privacy-preserving st
|
|||||||
In continuous data publishing the system computes, and publishes augmented or updated versions of one data set in different time points, and without a predefined duration.
|
In continuous data publishing the system computes, and publishes augmented or updated versions of one data set in different time points, and without a predefined duration.
|
||||||
In the context of privacy-preserving data publishing, privacy preservation is tightly coupled with the data processing and publishing stages.
|
In the context of privacy-preserving data publishing, privacy preservation is tightly coupled with the data processing and publishing stages.
|
||||||
|
|
||||||
As already discussed in Section~\ref{ch:intro}, in this survey we are studying the continuous data publishing mode, and thus we do not include works considering the snapshot paradigm.
|
As already discussed in Section~\ref{ch:intro}, in this work we are studying the continuous data publishing mode, and thus we do not include works considering the snapshot paradigm.
|
||||||
We make this deliberate choice as privacy-preserving continuous data publishing is a more complex problem, receiving more and more attention from the scientific community in the recent years, as shown by the increasing number of publications in this area.
|
We make this deliberate choice as privacy-preserving continuous data publishing is a more complex problem, receiving more and more attention from the scientific community in the recent years, as shown by the increasing number of publications in this area.
|
||||||
Moreover, the use cases of continuous data publishing abound, with the proliferation of the Internet, sensors, and connected devices, which produce and send to servers huge amounts of continuous personal data in astounding speed.
|
Moreover, the use cases of continuous data publishing abound, with the proliferation of the Internet, sensors, and connected devices, which produce and send to servers huge amounts of continuous personal data in astounding speed.
|
||||||
|
|
||||||
|
@ -4,21 +4,9 @@
|
|||||||
In this chapter, we introduce some relevant terminology and background knowledge around the problem of continuous publishing of sensitive data sets.
|
In this chapter, we introduce some relevant terminology and background knowledge around the problem of continuous publishing of sensitive data sets.
|
||||||
First, we categorize data as we view them in the context of continuous data publishing.
|
First, we categorize data as we view them in the context of continuous data publishing.
|
||||||
Second, we define data privacy, we list the kinds of attacks that have been identified in the literature, as well as the desired privacy levels that can be achieved, and the basic privacy operations that are applied to achieve data privacy.
|
Second, we define data privacy, we list the kinds of attacks that have been identified in the literature, as well as the desired privacy levels that can be achieved, and the basic privacy operations that are applied to achieve data privacy.
|
||||||
Third, we provide a brief overview of the seminal works on privacy-preserving data publishing, used also in continuous data publishing, fundamental in the domain and important for the understanding of the rest of the survey.
|
Third, we provide a brief overview of the seminal works on privacy-preserving data publishing, used also in continuous data publishing, fundamental in the domain and important for the understanding of the rest of the chapter.
|
||||||
|
|
||||||
To accompany and facilitate the descriptions in this chapter, we provide the following running example.
|
|
||||||
|
|
||||||
\begin{example}
|
|
||||||
\label{ex:snapshot}
|
|
||||||
Users interact with an LBS by making queries in order to retrieve some useful location-based information or just reporting user-state at various locations.
|
|
||||||
This user--LBS interaction generates user-related data, organized in a schema with the following attributes: \emph{Name} (the unique identifier of the table), \emph{Age}, \emph{Location}, and \emph{Status} (Table~\ref{tab:snapshot-micro}).
|
|
||||||
The `Status' attribute includes information that characterizes the user's state or the query itself, and its value varies according to the service functionality.
|
|
||||||
Subsequently, the generated data are aggregated (by issuing count queries over them) in order to derive useful information about the popularity of the venues during the day (Table~\ref{tab:snapshot-statistical}).
|
|
||||||
|
|
||||||
\includetable{snapshot}
|
|
||||||
|
|
||||||
\end{example}
|
|
||||||
|
|
||||||
\input{preliminaries/data}
|
\input{preliminaries/data}
|
||||||
\input{preliminaries/privacy}
|
\input{preliminaries/privacy}
|
||||||
|
\input{preliminaries/correlation}
|
||||||
\input{preliminaries/summary}
|
\input{preliminaries/summary}
|
||||||
|
@ -1,4 +1,4 @@
|
|||||||
\section{Privacy}
|
\section{Data privacy}
|
||||||
\label{sec:privacy}
|
\label{sec:privacy}
|
||||||
|
|
||||||
When personal data are publicly released, either as microdata or statistical data, individuals' privacy can be compromised, i.e,~an adversary becomes certain about an individual's personal information with a probability higher than a desired threshold.
|
When personal data are publicly released, either as microdata or statistical data, individuals' privacy can be compromised, i.e,~an adversary becomes certain about an individual's personal information with a probability higher than a desired threshold.
|
||||||
@ -19,7 +19,7 @@ Identity disclosure appears when we can guess that the sixth record of (a privac
|
|||||||
Attribute disclosure appears when it is revealed from (a privacy-protected version of) the microdata of Table~\ref{tab:snapshot-micro} that Quackmore is $62$ years old.
|
Attribute disclosure appears when it is revealed from (a privacy-protected version of) the microdata of Table~\ref{tab:snapshot-micro} that Quackmore is $62$ years old.
|
||||||
|
|
||||||
|
|
||||||
\subsection{Levels}
|
\subsection{Levels of privacy protection}
|
||||||
\label{subsec:prv-levels}
|
\label{subsec:prv-levels}
|
||||||
|
|
||||||
The information disclosure that a data release may entail is linked to the protection level that indicates \emph{what} a privacy-preserving algorithm is trying to achieve.
|
The information disclosure that a data release may entail is linked to the protection level that indicates \emph{what} a privacy-preserving algorithm is trying to achieve.
|
||||||
@ -64,62 +64,22 @@ In the extreme cases where $w$ is equal to either $1$ or to the size of the enti
|
|||||||
Although the described levels have been coined in the context of \emph{differential privacy}~\cite{dwork2006calibrating}, a seminal privacy method that we will discuss in more detail in Section~\ref{subsec:prv-statistical}, it is possible to apply their definitions to other privacy protection techniques as well.
|
Although the described levels have been coined in the context of \emph{differential privacy}~\cite{dwork2006calibrating}, a seminal privacy method that we will discuss in more detail in Section~\ref{subsec:prv-statistical}, it is possible to apply their definitions to other privacy protection techniques as well.
|
||||||
|
|
||||||
|
|
||||||
\subsection{Attacks}
|
\subsection{Attacks to privacy}
|
||||||
\label{subsec:prv-attacks}
|
\label{subsec:prv-attacks}
|
||||||
|
|
||||||
Information disclosure is typically achieved by combining supplementary (background) knowledge with the released data or by setting unrealistic assumptions while designing the privacy-preserving algorithms.
|
Information disclosure is typically achieved by combining supplementary (background) knowledge with the released data or by setting unrealistic assumptions while designing the privacy-preserving algorithms.
|
||||||
In its general form, this is known as \emph{adversarial} or \emph{linkage} attack.
|
In its general form, this is known as \emph{adversarial} or \emph{linkage} attack.
|
||||||
Even though many works directly refer to the general category of linkage attacks, we distinguish also the following sub-categories, addressed in the literature:
|
Even though many works directly refer to the general category of linkage attacks, we distinguish also the following sub-categories, addressed in the literature:
|
||||||
|
|
||||||
\begin{itemize}
|
\paragraph{Sensitive attribute domain} knowledge.
|
||||||
\item \emph{Sensitive attribute domain} knowledge.
|
|
||||||
Here we can identify \emph{homogeneity and skewness} attacks~\cite{machanavajjhala2006diversity,li2007t}, when statistics of the sensitive attribute values are available, and \emph{similarity attack}, when semantics of the sensitive attribute values are available.
|
Here we can identify \emph{homogeneity and skewness} attacks~\cite{machanavajjhala2006diversity,li2007t}, when statistics of the sensitive attribute values are available, and \emph{similarity attack}, when semantics of the sensitive attribute values are available.
|
||||||
\item \emph{Complementary release} attacks~\cite{sweeney2002k} with regard to previous releases of different versions of the same and/or related data sets.
|
|
||||||
|
\paragraph{Complementary release} attacks~\cite{sweeney2002k} with regard to previous releases of different versions of the same and/or related data sets.
|
||||||
In this category, we also identify the \emph{unsorted matching} attack~\cite{sweeney2002k}, which is achieved when two privacy-protected versions of an original data set are published in the same tuple ordering.
|
In this category, we also identify the \emph{unsorted matching} attack~\cite{sweeney2002k}, which is achieved when two privacy-protected versions of an original data set are published in the same tuple ordering.
|
||||||
Other instances include: (i)~the \emph{join} attack~\cite{wang2006anonymizing}, when tuples can be identified by joining (on the (quasi-)identifiers) several releases, (ii)~the \emph{tuple correspondence} attack~\cite{fung2008anonymity}, when in case of incremental data certain tuples correspond to certain tuples in other releases, in an injective way, (iii)~the \emph{tuple equivalence} attack~\cite{he2011preventing}, when tuples among different releases are found to be equivalent with respect to the sensitive attribute, and (iv)~the \emph{unknown releases} attack~\cite{shmueli2015privacy}, when the privacy preservation is performed without knowing the previously privacy-protected data sets.
|
Other instances include: (i)~the \emph{join} attack~\cite{wang2006anonymizing}, when tuples can be identified by joining (on the (quasi-)identifiers) several releases, (ii)~the \emph{tuple correspondence} attack~\cite{fung2008anonymity}, when in case of incremental data certain tuples correspond to certain tuples in other releases, in an injective way, (iii)~the \emph{tuple equivalence} attack~\cite{he2011preventing}, when tuples among different releases are found to be equivalent with respect to the sensitive attribute, and (iv)~the \emph{unknown releases} attack~\cite{shmueli2015privacy}, when the privacy preservation is performed without knowing the previously privacy-protected data sets.
|
||||||
\item \emph{Data dependence}
|
|
||||||
\begin{itemize}
|
|
||||||
\item within one data set.
|
|
||||||
Data tuples and data values within a data set may be correlated, or linked in such a way that information about one person can be inferred even if the person is absent from the database.
|
|
||||||
Consequently, in this category we put assumptions made on the data generation model based on randomness, like the random world model, the independent and identically distributed data (i.i.d.) model, or the independent-tuples model, which may be unrealistic for many real-world scenarios.
|
|
||||||
This attack is also known as the \emph{deFinetti's attack}~\cite{kifer2009attacks}.
|
|
||||||
\item among one data set and previous data releases, and/or other external sources~\cite{kifer2011no, chen2014correlated, liu2016dependence, zhao2017dependent}.
|
|
||||||
The strength of the dependence between a pair of variables can be quantified with the utilization of \emph{correlations}~\cite{stigler1989francis}.
|
|
||||||
Correlation implies dependence but not vice versa, however, the two terms are often used as synonyms.
|
|
||||||
The correlation among nearby observations, i.e.,~the elements in a series of data points, are referenced as \emph{autocorrelation} or \emph{serial correlation}~\cite{park2018fundamentals}.
|
|
||||||
Depending on the evaluation technique, e.g.,~\emph{Pearson's correlation coefficient}~\cite{stigler1989francis}, a correlation can be characterized as \emph{negative}, \emph{zero}, or \emph{positive}.
|
|
||||||
A negative value shows that the behavior of one variable is the \emph{opposite} of that of the other, e.g.,~when the one increases the other decreases.
|
|
||||||
Zero means that the variables are not linked and are \emph{independent} of each other.
|
|
||||||
A positive correlation indicates that the variables behave in a \emph{similar} manner, e.g.,~when the one decreases the other decreases as well.
|
|
||||||
|
|
||||||
The most prominent types of correlations might be:
|
\paragraph{Data dependence} either within one data set or among one data set and previous data releases, and/or other external sources~\cite{kifer2011no, chen2014correlated, liu2016dependence, zhao2017dependent}.
|
||||||
\begin{itemize}
|
We will look into this category in more detail later in Section~\ref{sec:correlation}.
|
||||||
\item \emph{Temporal}~\cite{wei2006time}---appearing in observations (i.e.,~values) of the same object over time.
|
|
||||||
\item \emph{Spatial}~\cite{legendre1993spatial, anselin1995local}---denoted by the degree of similarity of nearby data points in space, and indicating if and how phenomena relate to the (broader) area where they take place.
|
|
||||||
\item \emph{Spatiotemporal}---a combination of the previous categories, appearing when processing time series or sequences of human activities with geolocation characteristics, e.g.,~\cite{ghinita2009preventing}.
|
|
||||||
\end{itemize}
|
|
||||||
Contrary to one-dimensional correlations, spatial correlation is multi-dimensional and multi-directional, and can be measured by indicators (e.g.,~\emph{Moran's I}~\cite{moran1950notes}) that reflect the \emph{spatial association} of the concerned data.
|
|
||||||
Spatial autocorrelation has its foundations in the \emph{First Law of Geography} stating that ``everything is related to everything else, but near things are more related than distant things''~\cite{tobler1970computer}.
|
|
||||||
A positive spatial autocorrelation indicates that similar data are \emph{clustered}, a negative that data are dispersed and are close to dissimilar ones, and when close to zero, that data are \emph{randomly arranged} in space.
|
|
||||||
\end{itemize}
|
|
||||||
|
|
||||||
A common practice for extracting data dependencies from continuous data, is by expressing the data as a \emph{stochastic} or \emph{random process}.
|
|
||||||
A random process is a collection of \emph{random variables} or \emph{bivariate data}, indexed by some set, e.g.,~a series of timestamps, a Cartesian plane $\mathbb{R}^2$, an $n$-dimensional Euclidean space, etc.~\cite{skorokhod2005basic}.
|
|
||||||
The values a random variable can take are outcomes of an unpredictable process, while bivariate data are pairs of data values with a possible association between them.
|
|
||||||
Expressing data as stochastic processes allows their modeling depending on their properties, and thereafter the discovery of relevant data dependencies.
|
|
||||||
Some common stochastic processes modeling techniques include:
|
|
||||||
|
|
||||||
\begin{itemize}
|
|
||||||
\item \emph{Conditional probabilities}~\cite{allan2013probability}---probabilities of events in the presence of other events.
|
|
||||||
\item \emph{Conditional Random Fields} (CRFs)~\cite{lafferty2001conditional}---undirected graphs encoding conditional probability distributions.
|
|
||||||
\item \emph{Markov processes}~\cite{rogers2000diffusions}---stochastic processes for which the conditional probability of their future states depends only on the present state and it is independent of its previous states (\emph{Markov assumption}).
|
|
||||||
\begin{itemize}
|
|
||||||
\item \emph{Markov chains}~\cite{gagniuc2017markov}---sequences of possible events whose probability depends on the state attained in the previous event.
|
|
||||||
\item \emph{Hidden Markov Models} (HMMs)~\cite{baum1966statistical}---statistical Markov models of Markov processes with unobserved states.
|
|
||||||
\end{itemize}
|
|
||||||
\end{itemize}
|
|
||||||
|
|
||||||
\end{itemize}
|
|
||||||
|
|
||||||
The first sub-category of attacks has been mainly addressed in works on snapshot microdata publishing, and is still present in continuous publishing; however, algorithms for continuous publishing typically accept the proposed solutions for the snapshot publishing scheme (see discussion over $k$-anonymity and $l$-diversity in Section~\ref{subsec:prv-seminal}).
|
The first sub-category of attacks has been mainly addressed in works on snapshot microdata publishing, and is still present in continuous publishing; however, algorithms for continuous publishing typically accept the proposed solutions for the snapshot publishing scheme (see discussion over $k$-anonymity and $l$-diversity in Section~\ref{subsec:prv-seminal}).
|
||||||
This kind of attacks is tightly coupled with publishing the (privacy-protected) sensitive attribute value.
|
This kind of attacks is tightly coupled with publishing the (privacy-protected) sensitive attribute value.
|
||||||
@ -131,7 +91,7 @@ By the data dependence attack, the status of Donald could be more certainly infe
|
|||||||
In order to better protect the privacy of Donald in case of attacks, the data should be privacy-protected in a more adequate way (than without the attacks).
|
In order to better protect the privacy of Donald in case of attacks, the data should be privacy-protected in a more adequate way (than without the attacks).
|
||||||
|
|
||||||
|
|
||||||
\subsection{Operations}
|
\subsection{Privacy-preserving operations}
|
||||||
\label{subsec:prv-operations}
|
\label{subsec:prv-operations}
|
||||||
|
|
||||||
Protecting private information, which is known by many names (obfuscation, cloaking, anonymization, etc.), is achieved by using a specific basic privacy protection operation.
|
Protecting private information, which is known by many names (obfuscation, cloaking, anonymization, etc.), is achieved by using a specific basic privacy protection operation.
|
||||||
@ -156,7 +116,7 @@ Our focus is limited to techniques that achieve a satisfying balance between bot
|
|||||||
For these reasons, there will be no further discussion around this family of techniques in this article.
|
For these reasons, there will be no further discussion around this family of techniques in this article.
|
||||||
|
|
||||||
|
|
||||||
\subsection{Seminal works}
|
\subsection{Seminal works in privacy protection}
|
||||||
\label{subsec:prv-seminal}
|
\label{subsec:prv-seminal}
|
||||||
\kat{Seminal works fit best in the related work section}
|
\kat{Seminal works fit best in the related work section}
|
||||||
|
|
||||||
@ -315,174 +275,6 @@ When the users consider recent data releases more privacy sensitive than distant
|
|||||||
A set of $m \in \mathbb{Z}^+$ independent privacy mechanisms, satisfying $\varepsilon_1$-, $\varepsilon_2$-,\dots, $\varepsilon_m$-differential privacy respectively, satisfy $\sum_{i = 1}^m g(i) \varepsilon_i$ differential privacy for a discount function $g$.
|
A set of $m \in \mathbb{Z}^+$ independent privacy mechanisms, satisfying $\varepsilon_1$-, $\varepsilon_2$-,\dots, $\varepsilon_m$-differential privacy respectively, satisfy $\sum_{i = 1}^m g(i) \varepsilon_i$ differential privacy for a discount function $g$.
|
||||||
\end{theorem}
|
\end{theorem}
|
||||||
|
|
||||||
% The presence of temporal correlations might result into additional privacy loss consisting of \emph{backward privacy loss} $\alpha^B$ and \emph{forward privacy loss} $\alpha^F$~\cite{cao2017quantifying}.
|
|
||||||
Cao et al.~\cite{cao2017quantifying} propose a method for computing the temporal privacy loss (TPL) of a differential privacy mechanism in the presence of temporal correlations and background knowledge.
|
|
||||||
The goal of their technique is to guarantee privacy protection and to bound the privacy loss at every timestamp under the assumption of independent data releases.
|
|
||||||
It calculates the temporal privacy loss as the sum of the backward and forward privacy loss minus the default privacy loss $\varepsilon$ of the mechanism (because it is counted twice in the aforementioned entities).
|
|
||||||
This calculation is done for each individual that is included in the original data set and the overall temporal privacy loss is equal to the maximum calculated value at every timestamp.
|
|
||||||
The backward/forward privacy loss at any timestamp depends on the backward/forward privacy loss at the previous/next timestamp, the backward/forward temporal correlations, and $\varepsilon$.
|
|
||||||
|
|
||||||
\begin{definition}
|
|
||||||
[Temporal privacy loss (TPL)]
|
|
||||||
\label{def:tpl}
|
|
||||||
The potential privacy loss of a privacy mechanism at a timestamp $t \leq T$ due to a series of outputs $\pmb{o}_1$, \dots, $\pmb{o}_T$ and temporal correlations in its input $D_t$ with respect to any adversary, targeting an individual with potential data items $x_t$ (or $x'_t$) and having knowledge $\mathbb{D}_t$ equal to $D_t - \{x_t\}$ (or $D'_t - \{x'_t\}$), is defined as:
|
|
||||||
|
|
||||||
\begin{equation}
|
|
||||||
\label{eq:tpl}
|
|
||||||
\alpha_t = \sup_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_T} \ln \frac{\Pr[\pmb{o}_1, \dots, \pmb{o}_T | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_1, \dots, \pmb{o}_T | x'_t, \mathbb{D}_t]}
|
|
||||||
\end{equation}
|
|
||||||
\end{definition}
|
|
||||||
%
|
|
||||||
By analyzing Equation~\ref{eq:tpl} we get the following:
|
|
||||||
|
|
||||||
\begin{align}
|
|
||||||
\label{eq:tpl-1}
|
|
||||||
(\ref{eq:tpl}) = & \underbrace{\sup_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_1, \dots, \pmb{o}_t| x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_1, \dots, \pmb{o}_t | x'_t, \mathbb{D}_t]}}_{\text{Backward privacy loss}\ (\alpha^B_t)} \nonumber \\
|
|
||||||
& + \underbrace{\sup_{x_t, x'_t, \pmb{o}_t, \dots, \pmb{o}_T} \ln \frac{\Pr[\pmb{o}_t, \dots, \pmb{o}_T| x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t, \dots, \pmb{o}_T | x'_t, \mathbb{D}_t]}}_{\text{Forward privacy loss}\ (\alpha^F_t)} \nonumber \\
|
|
||||||
& - \underbrace{\sup_{x_t, x'_t, \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t] }{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]}}_{\text{Present privacy loss}\ (\varepsilon_t)}
|
|
||||||
\end{align}
|
|
||||||
|
|
||||||
\begin{definition}
|
|
||||||
[Backward privacy loss (BPL)]
|
|
||||||
\label{def:bpl}
|
|
||||||
The potential privacy loss of a privacy mechanism at a timestamp $t \leq T$ due to outputs $\pmb{o}_1$, \dots, $\pmb{o}_t$ and temporal correlations in its input $D_t$ with respect to any adversary, targeting an individual with potential data items $x_t$ (or $x'_t$) and having knowledge $\mathbb{D}_t$ equal to $D_t - \{x_t\}$ (or $D'_t - \{x'_t\}$), is called backward privacy loss and is defined as:
|
|
||||||
|
|
||||||
\begin{equation}
|
|
||||||
\label{eq:bpl-1}
|
|
||||||
\alpha^B_t = \sup_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_1, \dots, \pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_1, \dots, \pmb{o}_t | x'_t, \mathbb{D}_t]}
|
|
||||||
\end{equation}
|
|
||||||
|
|
||||||
\end{definition}
|
|
||||||
%
|
|
||||||
From differential privacy we have the assumption that $\pmb{o}_1$, \dots, $\pmb{o}_t$ are independent events.
|
|
||||||
Therefore, according to the Bayesian theorem, we can write Equation~\ref{eq:bpl-1} as:
|
|
||||||
|
|
||||||
\begin{align}
|
|
||||||
\label{eq:bpl-2}
|
|
||||||
(\ref{eq:bpl-1}) = & \sup_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1}| x_t, \mathbb{D}_t] \Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x'_t, \mathbb{D}_t] \Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]} \nonumber \\
|
|
||||||
= & \sup_{x_t, x_t', \pmb{o}_1, \dots, \pmb{o}_{t - 1}} \ln \frac{\Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x_t', \mathbb{D}_t]} \nonumber \\
|
|
||||||
& + \sup_{x_t, x_t', \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]}
|
|
||||||
\end{align}
|
|
||||||
%
|
|
||||||
Applying the law of total probability to the first term of Equation~\ref{eq:bpl-2} for all the possible data $x_{t - 1}$ (or $x'_{t - 1}$) and $\mathbb{D}_{t - 1}$ we get the following:
|
|
||||||
|
|
||||||
\begin{align}
|
|
||||||
\label{eq:bpl-3}
|
|
||||||
(\ref{eq:bpl-2}) = &
|
|
||||||
\adjustbox{max width=0.9\linewidth}{
|
|
||||||
$\sup\limits_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_{t - 1}} \ln \frac{\sum\limits_{x_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x_t, \mathbb{D}_t, x_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x_{t - 1}, \mathbb{D}_{t - 1} | x_t, \mathbb{D}_t]}{\sum\limits_{x'_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x'_t, \mathbb{D}_t, x'_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x'_{t - 1}, \mathbb{D}_{t - 1} | x'_t, \mathbb{D}_t]}$
|
|
||||||
} \nonumber \\
|
|
||||||
& \adjustbox{max width=0.3\linewidth}{
|
|
||||||
$+ \sup\limits_{x_t, x_t', \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]}$
|
|
||||||
}
|
|
||||||
\end{align}
|
|
||||||
%
|
|
||||||
Since $\mathbb{D}_t$ is equal to $D_t - \{x_t\}$ (or $D'_t - \{x'_t\}$), and thus is constant and independent of every possible $x_t$ (or $x'_t$), $\forall t \leq T$, Equation~\ref{eq:bpl-3} can be written as:
|
|
||||||
|
|
||||||
\begin{align}
|
|
||||||
\label{eq:bpl-4}
|
|
||||||
(\ref{eq:bpl-3}) = &
|
|
||||||
\adjustbox{max width=0.9\linewidth}{
|
|
||||||
$\sup\limits_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_{t - 1}} \ln \frac{\sum\limits_{x_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x_t, \mathbb{D}_t, x_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x_{t - 1} | x_t, \mathbb{D}_t] \Pr[\mathbb{D}_{t - 1} | x_t, \mathbb{D}_t]}{\sum\limits_{x'_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x'_t, \mathbb{D}_t, x'_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x'_{t - 1} | x'_t, \mathbb{D}_t] \Pr[\mathbb{D}_{t - 1} | x'_t, \mathbb{D}_t]}$
|
|
||||||
} \nonumber \\
|
|
||||||
& \adjustbox{max width=0.275\linewidth}{
|
|
||||||
$+ \sup\limits_{x_t, x_t', \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]}$
|
|
||||||
} \nonumber \\
|
|
||||||
= & \adjustbox{max width=0.825\linewidth}{
|
|
||||||
$\sup\limits_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_{t - 1}} \ln \frac{\sum\limits_{x_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x_t, \mathbb{D}_t, x_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x_{t - 1} | x_t] \Pr[\mathbb{D}_{t - 1} | \mathbb{D}_t]}{\sum\limits_{x'_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x'_t, \mathbb{D}_t, x'_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x'_{t - 1} | x'_t] \Pr[\mathbb{D}_{t - 1} | \mathbb{D}_t]}$
|
|
||||||
} \nonumber \\
|
|
||||||
& \adjustbox{max width=0.275\linewidth}{
|
|
||||||
$+ \sup\limits_{x_t, x_t', \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]}$
|
|
||||||
} \nonumber \\
|
|
||||||
= & \adjustbox{max width=0.7\linewidth}{
|
|
||||||
$\sup\limits_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_{t - 1}} \ln \frac{\sum\limits_{x_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x_t, \mathbb{D}_t, x_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x_{t - 1} | x_t]}{\sum\limits_{x'_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x'_t, \mathbb{D}_t, x'_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x'_{t - 1} | x'_t]}$
|
|
||||||
} \nonumber \\
|
|
||||||
& \adjustbox{max width=0.275\linewidth}{
|
|
||||||
$+ \sup\limits_{x_t, x_t', \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]}$
|
|
||||||
}
|
|
||||||
\end{align}
|
|
||||||
%
|
|
||||||
The outputs $\pmb{o}_1$, \dots, $\pmb{o}_t$ and $x_t$ (or $x'_t$) are conditionally independent in the presence of
|
|
||||||
$x_{t - 1}$ (or $x'_{t - 1}$), and thus Equation~\ref{eq:bpl-4} can be written as:
|
|
||||||
|
|
||||||
\begin{align}
|
|
||||||
\label{eq:bpl-5}
|
|
||||||
(\ref{eq:bpl-4}) = &
|
|
||||||
\adjustbox{max width=0.9\linewidth}{
|
|
||||||
$\sup\limits_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_{t - 1}} \ln \cfrac{\sum\limits_{x_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x_{t - 1} | x_t]}{\sum\limits_{x'_{t - 1}} \underbrace{\Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x'_{t - 1}, \mathbb{D}_{t - 1}]}_{\alpha^B_{t - 1}} \underbrace{\Pr[x'_{t - 1} | x'_t]}_{P^B_{t - 1}}}$
|
|
||||||
} \nonumber \\
|
|
||||||
& \adjustbox{max width=0.4\linewidth}{
|
|
||||||
$+ \underbrace{\sup\limits_{x_t, x'_t, \pmb{o}_t} \ln \cfrac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]}}_{\varepsilon_t}$
|
|
||||||
}
|
|
||||||
\end{align}
|
|
||||||
|
|
||||||
\begin{definition}
|
|
||||||
[Forward privacy loss (FPL)]
|
|
||||||
\label{def:fpl}
|
|
||||||
The potential privacy loss of a privacy mechanism at a timestamp $t \leq T$ due to outputs $\pmb{o}_t$,\dots,$\pmb{o}_T$ and temporal correlations in its input $D_t$ with respect to any adversary, targeting an individual with potential data item $x_t$ (or $x'_t$) and having knowledge $\mathbb{D}_t$ equal to $D_t - \{x_t\}$ (or $D'_t - \{x'_t\}$), is called forward privacy loss and is defined as:
|
|
||||||
|
|
||||||
\begin{equation}
|
|
||||||
\label{eq:fpl-1}
|
|
||||||
\alpha^F_t = \sup_{x_t, x'_t, \pmb{o}_t, \dots, \pmb{o}_T} \ln \frac{\Pr[\pmb{o}_t, \dots, \pmb{o}_T | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t, \dots, \pmb{o}_T | x'_t, \mathbb{D}_t]}
|
|
||||||
\end{equation}
|
|
||||||
\end{definition}
|
|
||||||
%
|
|
||||||
Similar to the way that we concluded to Equation~\ref{eq:bpl-5} from Equation~\ref{eq:bpl-1} we can write Equation~\ref{eq:fpl-1} as follows:
|
|
||||||
|
|
||||||
\begin{align}
|
|
||||||
\label{eq:fpl-2}
|
|
||||||
(\ref{eq:fpl-1}) = &
|
|
||||||
\adjustbox{max width=0.9\linewidth}{
|
|
||||||
$\sup\limits_{x_t, x'_t, \pmb{o}_{t + 1}, \dots, \pmb{o}_T} \ln \cfrac{\sum\limits_{x_{t + 1}} \Pr[\pmb{o}_{t + 1}, \dots, \pmb{o}_T | x_{t + 1}, \mathbb{D}_{t + 1}] \Pr[x_{t + 1} | x_t]}{\sum\limits_{x'_{t + 1}} \underbrace{\Pr[\pmb{o}_{t + 1}, \dots, \pmb{o}_T | x'_{t + 1}, \mathbb{D}_{t + 1}]}_{\alpha^F_{t + 1}} \underbrace{\Pr[x'_{t + 1} | x'_t]}_{P^F_{t + 1}}}$
|
|
||||||
} \nonumber \\
|
|
||||||
& \adjustbox{max width=0.4\linewidth}{
|
|
||||||
$+ \underbrace{\sup\limits_{x_t, x'_t, \pmb{o}_t} \ln \cfrac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]}}_{\varepsilon_t}$
|
|
||||||
}
|
|
||||||
\end{align}
|
|
||||||
|
|
||||||
Equations~\ref{eq:tpl-1},~\ref{eq:bpl-5}, and~\ref{eq:fpl-2} apply to the global publishing schema.
|
|
||||||
In the local schema, $D$ (or $D'$) is a single data item and is the same with $x$ (or $x'$), i.e.,~the possible data item of an individual user.
|
|
||||||
Therefore, we calculate the extra privacy loss under temporal correlations, due to an adversary that targets a user at a timestamp $t$, based on the assumption that their possible data are $D_t$ or $D'_t$.
|
|
||||||
More specifically, the calculation of TPL (Equation~\ref{eq:tpl-1}) becomes:
|
|
||||||
\begin{align}
|
|
||||||
\label{eq:tpl-local}
|
|
||||||
& \underbrace{\sup_{D_t, D'_t, \pmb{o}_1, \dots, \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_1, \dots, \pmb{o}_t| D_t]}{\Pr[\pmb{o}_1, \dots, \pmb{o}_t | D'_t]}}_{\text{Backward privacy loss}\ (\alpha^B_t)} \nonumber \\
|
|
||||||
& + \underbrace{\sup_{D_t, D'_t, \pmb{o}_t, \dots, \pmb{o}_T} \ln \frac{\Pr[\pmb{o}_t, \dots, \pmb{o}_T| D_t]}{\Pr[\pmb{o}_t, \dots, \pmb{o}_T | D'_t]}}_{\text{Forward privacy loss}\ (\alpha^F_t)} \nonumber \\
|
|
||||||
& - \underbrace{\sup_{D_t, D'_t, \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_t | D_t]}{\Pr[\pmb{o}_t | D'_t]}}_{\text{Present privacy loss}\ (\varepsilon_t)}
|
|
||||||
\end{align}
|
|
||||||
%
|
|
||||||
The calculation of BPL (Equation~\ref{eq:bpl-5}) becomes:
|
|
||||||
\begin{align}
|
|
||||||
\label{eq:bpl-local}
|
|
||||||
& \adjustbox{max width=0.9\linewidth}{
|
|
||||||
$\sup\limits_{D_t, D'_t, \pmb{o}_1, \dots, \pmb{o}_{t - 1}} \ln \cfrac{\sum\limits_{D_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | D_{t - 1}, ] \Pr[D_{t - 1} | D_t]}{\sum\limits_{D'_{t - 1}} \underbrace{\Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | D'_{t - 1}, ]}_{\alpha^B_{t - 1}} \underbrace{\Pr[D'_{t - 1} | D'_t]}_{P^B_{t - 1}}}$
|
|
||||||
} \nonumber \\
|
|
||||||
& \adjustbox{max width=0.4\linewidth}{
|
|
||||||
$+ \underbrace{\sup\limits_{D_t, D'_t, \pmb{o}_t} \ln \cfrac{\Pr[\pmb{o}_t | D_t]}{\Pr[\pmb{o}_t | D'_t]}}_{\varepsilon_t}$
|
|
||||||
}
|
|
||||||
\end{align}
|
|
||||||
%
|
|
||||||
The calculation of FPL (Equation~\ref{eq:fpl-2}) becomes:
|
|
||||||
\begin{align}
|
|
||||||
\label{eq:fpl-local}
|
|
||||||
& \adjustbox{max width=0.9\linewidth}{
|
|
||||||
$\sup\limits_{D_t, D'_t, \pmb{o}_{t + 1}, \dots, \pmb{o}_T} \ln \cfrac{\sum\limits_{D_{t + 1}} \Pr[\pmb{o}_{t + 1}, \dots, \pmb{o}_T | D_{t + 1}] \Pr[D_{t + 1} | D_t]}{\sum\limits_{D'_{t + 1}} \underbrace{\Pr[\pmb{o}_{t + 1}, \dots, \pmb{o}_T | D'_{t + 1}]}_{\alpha^F_{t + 1}} \underbrace{\Pr[D'_{t + 1} | D'_t]}_{P^F_{t + 1}}}$
|
|
||||||
} \nonumber \\
|
|
||||||
& \adjustbox{max width=0.4\linewidth}{
|
|
||||||
$+ \underbrace{\sup\limits_{D_t, D'_t, \pmb{o}_t} \ln \cfrac{\Pr[\pmb{o}_t | D_t]}{\Pr[\pmb{o}_t | D'_t]}}_{\varepsilon_t}$
|
|
||||||
}
|
|
||||||
\end{align}
|
|
||||||
|
|
||||||
The authors propose solutions to bound the temporal privacy loss, under the presence of weak to moderate correlations, in both finite and infinite data publishing scenarios.
|
|
||||||
In the latter case, they try to find a value for $\varepsilon$ for which the backward and forward privacy loss are equal.
|
|
||||||
In the former, they similarly try to balance the backward and forward privacy loss while they allocate more $\varepsilon$ at the first and last timestamps, since they have higher impact to the privacy loss of the next and previous ones.
|
|
||||||
This way they achieve an overall constant temporal privacy loss throughout the time series.
|
|
||||||
|
|
||||||
According to the technique's intuition, stronger correlations result in higher privacy loss.
|
|
||||||
However, the loss is less when the dimension of the transition matrix, which is extracted according to the modeling of the correlations (in this work they use Markov chains), is greater due to the fact that larger transition matrices tend to be uniform, resulting in weaker data dependence.
|
|
||||||
The authors investigate briefly all of the possible privacy levels; however, the solutions that they propose are applied only on the event-level.
|
|
||||||
Last but not least, the technique requires the calculation of the temporal privacy loss for every individual within the data set that might prove computationally inefficient in real-time scenarios.
|
|
||||||
|
|
||||||
When dealing with temporally correlated data, we handle a sequence of $w \leq t \in \mathbb{Z}^+$ mechanisms (indexed by $m \in [1, t]$) as a single entity where each mechanism contributes to the temporal privacy loss depending on its order of application~\cite{cao2017quantifying}.
|
When dealing with temporally correlated data, we handle a sequence of $w \leq t \in \mathbb{Z}^+$ mechanisms (indexed by $m \in [1, t]$) as a single entity where each mechanism contributes to the temporal privacy loss depending on its order of application~\cite{cao2017quantifying}.
|
||||||
The first ($m - 1$ if $w \leq 2$ or $m - w + 1$ if $w > 2$) and last ($m$) mechanisms contribute to the backward and forward temporal privacy loss respectively.
|
The first ($m - 1$ if $w \leq 2$ or $m - w + 1$ if $w > 2$) and last ($m$) mechanisms contribute to the backward and forward temporal privacy loss respectively.
|
||||||
When $w$ is greater than $2$, the rest of the mechanisms (between $m - w + 2$ and $m - 1$) contribute only to the privacy loss that is corresponding to the publication of the relevant data.
|
When $w$ is greater than $2$, the rest of the mechanisms (between $m - w + 2$ and $m - 1$) contribute only to the privacy loss that is corresponding to the publication of the relevant data.
|
||||||
|
Loading…
Reference in New Issue
Block a user