preliminaries: Done
This commit is contained in:
parent
9acae8ed67
commit
487fa06d01
@ -1,4 +1,4 @@
|
||||
\begin{table}
|
||||
\begin{figure}
|
||||
\centering\noindent\adjustbox{max width=\linewidth} {
|
||||
\begin{tabular}{@{}ccc@{}}
|
||||
\begin{tabular}{@{}lrll@{}}
|
||||
@ -32,5 +32,5 @@
|
||||
\end{tabular}%
|
||||
}%
|
||||
\caption{3-anonymous event-level protected versions of the microdata in Table~\ref{tab:continuous-micro}.}
|
||||
\label{tab:scenario-micro}
|
||||
\end{table}
|
||||
\label{fig:scenario-micro}
|
||||
\end{figure}
|
||||
|
@ -1,4 +1,4 @@
|
||||
\begin{table}
|
||||
\begin{figure}
|
||||
\centering
|
||||
\subcaptionbox{True counts\label{tab:statistical-true}}{%
|
||||
\begin{tabular}{@{}lr@{}}
|
||||
@ -6,10 +6,10 @@
|
||||
Location & \multicolumn{1}{c@{}}{Count} \\
|
||||
\midrule
|
||||
Belleville & $1$ \\
|
||||
Latin Quarter & $1$ \\
|
||||
Quartier Latin & $1$ \\
|
||||
Le Marais & $1$ \\
|
||||
Montmartre & $2$ \\
|
||||
Opera & $1$ \\
|
||||
Opéra & $1$ \\
|
||||
\bottomrule
|
||||
\end{tabular}%
|
||||
}\quad
|
||||
@ -26,13 +26,13 @@
|
||||
Location & \multicolumn{1}{c@{}}{Count} \\
|
||||
\midrule
|
||||
Belleville & $1$ \\
|
||||
Latin Quarter & $0$ \\
|
||||
Quartier Latin & $0$ \\
|
||||
Le Marais & $2$ \\
|
||||
Montmartre & $3$ \\
|
||||
Opera & $1$ \\
|
||||
Opéra & $1$ \\
|
||||
\bottomrule
|
||||
\end{tabular}%
|
||||
}%
|
||||
\caption{(a)~The original version of the data of Table~\ref{tab:continuous-statistical}, and (b)~their $1$-differentially event-level private version.}
|
||||
\label{tab:scenario-statistical}
|
||||
\end{table}
|
||||
\label{fig:scenario-statistical}
|
||||
\end{figure}
|
||||
|
@ -1,12 +1,12 @@
|
||||
\section{Data correlation}
|
||||
\label{sec:correlation}
|
||||
% \kat{Please add some introduction to each section, presenting what you will discuss afterwards, and link it somehow to what was already discussed.}
|
||||
In this Section we study the most prominent types of correlation, practices for extracting correlation from continuous data, privacy risks of correlation with a special emphasis on temporal correlation.
|
||||
|
||||
\kat{Please add some introduction to each section, presenting what you will discuss afterwards, and link it somehow to what was already discussed.}
|
||||
|
||||
\subsection{Types of correlation}
|
||||
\label{subsec:cor-types}
|
||||
|
||||
The most prominent types of correlation are:
|
||||
The most prominent types of correlation are \emph{temporal}, \emph{spatial}, and \emph{spatiotemporal}.
|
||||
|
||||
\begin{itemize}
|
||||
\item \emph{Temporal}~\cite{wei2006time}---appearing in observations (i.e.,~values) of the same object over time.
|
||||
@ -17,7 +17,8 @@ The most prominent types of correlation are:
|
||||
Contrary to one-dimensional correlation, spatial correlation is multi-dimensional and multi-directional, and can be measured by indicators (e.g.,~\emph{Moran's I}~\cite{moran1950notes}) that reflect the \emph{spatial association} of the concerned data.
|
||||
Spatial autocorrelation has its foundations in the \emph{First Law of Geography} stating that ``everything is related to everything else, but near things are more related than distant things''~\cite{tobler1970computer}.
|
||||
A positive spatial autocorrelation indicates that similar data are \emph{clustered}, a negative that data are dispersed and are close to dissimilar ones, and when close to zero, that data are \emph{randomly arranged} in space.
|
||||
\kat{I still do not like this focus on spatial correlation.. maybe remove it totally? we only consider temporal correlation in the main work in any case.}
|
||||
% \kat{I still do not like this focus on spatial correlation.. maybe remove it totally? we only consider temporal correlation in the main work in any case.}
|
||||
% \mk{We consider it in general nonetheless, so we cannot ignore it}
|
||||
|
||||
\subsection{Extraction of correlation}
|
||||
\label{subsec:cor-ext}
|
||||
@ -47,7 +48,7 @@ Correlation appears in dependent data:
|
||||
|
||||
\begin{itemize}
|
||||
\item within one data set, and
|
||||
\item among one data set and previous data releases, and/or other external sources~\cite{kifer2011no, chen2014correlated, liu2016dependence, zhao2017dependent}.
|
||||
\item among one data set and previous data releases, and/or other external sources~\cite{kifer2011no, chen2014correlated, liu2016dependence, zhao2017dependent}.
|
||||
\end{itemize}
|
||||
|
||||
In the former case, data tuples and data values within a data set may be correlated, or linked in such a way that information about one person can be inferred even if the person is absent from the database.
|
||||
@ -80,11 +81,11 @@ The backward/forward privacy loss at any timestamp depends on the backward/forwa
|
||||
\begin{definition}
|
||||
[Temporal privacy loss (TPL)]
|
||||
\label{def:tpl}
|
||||
The potential privacy loss of a privacy mechanism at a timestamp $t \leq T$ due to a series of outputs $\pmb{o}_1$, \dots, $\pmb{o}_T$ and temporal correlation in its input $D_t$ with respect to any adversary, targeting an individual with potential data items $x_t$ (or $x'_t$) and having knowledge $\mathbb{D}_t$ equal to $D_t - \{x_t\}$ (or $D'_t - \{x'_t\}$), is defined as:
|
||||
The potential privacy loss of a privacy mechanism at a timestamp $t \in T$ due to a series of outputs $(\pmb{o}_i)_{i \in T}$ and temporal correlation in its input $D_t$ with respect to any adversary, targeting an individual with potential data items $x_t$ (or $x'_t$) and having knowledge $\mathbb{D}_t$ equal to $D_t - \{x_t\}$ (or $D'_t - \{x'_t\}$), is defined as:
|
||||
|
||||
\begin{equation}
|
||||
\label{eq:tpl}
|
||||
\alpha_t = \sup_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_T} \ln \frac{\Pr[\pmb{o}_1, \dots, \pmb{o}_T | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_1, \dots, \pmb{o}_T | x'_t, \mathbb{D}_t]}
|
||||
\alpha_t = \sup_{x_t, x'_t, (\pmb{o}_i)_{i \in T}} \ln \frac{\Pr[(\pmb{o}_i)_{i \in T} | x_t, \mathbb{D}_t]}{\Pr[(\pmb{o}_i)_{i \in T} | x'_t, \mathbb{D}_t]}
|
||||
\end{equation}
|
||||
\end{definition}
|
||||
%
|
||||
@ -92,79 +93,79 @@ By analyzing Equation~\ref{eq:tpl} we get the following:
|
||||
|
||||
\begin{align}
|
||||
\label{eq:tpl-1}
|
||||
(\ref{eq:tpl}) = & \underbrace{\sup_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_1, \dots, \pmb{o}_t| x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_1, \dots, \pmb{o}_t | x'_t, \mathbb{D}_t]}}_{\text{Backward privacy loss}\ (\alpha^B_t)} \nonumber \\
|
||||
& + \underbrace{\sup_{x_t, x'_t, \pmb{o}_t, \dots, \pmb{o}_T} \ln \frac{\Pr[\pmb{o}_t, \dots, \pmb{o}_T| x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t, \dots, \pmb{o}_T | x'_t, \mathbb{D}_t]}}_{\text{Forward privacy loss}\ (\alpha^F_t)} \nonumber \\
|
||||
(\ref{eq:tpl}) = & \underbrace{\sup_{x_t, x'_t, (\pmb{o}_i)_{i \in [\min(T), t]}} \ln \frac{\Pr[(\pmb{o}_i)_{i \in [\min(T), t]}| x_t, \mathbb{D}_t]}{\Pr[(\pmb{o}_i)_{i \in [\min(T), t]} | x'_t, \mathbb{D}_t]}}_{\text{Backward privacy loss}\ (\alpha^B_t)} \nonumber \\
|
||||
& + \underbrace{\sup_{x_t, x'_t, (\pmb{o}_i)_{i \in [t, \max(T)]}} \ln \frac{\Pr[(\pmb{o}_i)_{i \in [t, \max(T)]}| x_t, \mathbb{D}_t]}{\Pr[(\pmb{o}_i)_{i \in [t, \max(T)]} | x'_t, \mathbb{D}_t]}}_{\text{Forward privacy loss}\ (\alpha^F_t)} \nonumber \\
|
||||
& - \underbrace{\sup_{x_t, x'_t, \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t] }{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]}}_{\text{Present privacy loss}\ (\varepsilon_t)}
|
||||
\end{align}
|
||||
|
||||
\begin{definition}
|
||||
[Backward privacy loss (BPL)]
|
||||
\label{def:bpl}
|
||||
The potential privacy loss of a privacy mechanism at a timestamp $t \leq T$ due to outputs $\pmb{o}_1$, \dots, $\pmb{o}_t$ and temporal correlation in its input $D_t$ with respect to any adversary, targeting an individual with potential data items $x_t$ (or $x'_t$) and having knowledge $\mathbb{D}_t$ equal to $D_t - \{x_t\}$ (or $D'_t - \{x'_t\}$), is called backward privacy loss and is defined as:
|
||||
The potential privacy loss of a privacy mechanism at a timestamp $t \in T$ due to outputs $(\pmb{o}_i)_{i \in [\min(T), t]}$ and temporal correlation in its input $D_t$ with respect to any adversary, targeting an individual with potential data items $x_t$ (or $x'_t$) and having knowledge $\mathbb{D}_t$ equal to $D_t - \{x_t\}$ (or $D'_t - \{x'_t\}$), is called backward privacy loss and is defined as:
|
||||
|
||||
\begin{equation}
|
||||
\label{eq:bpl-1}
|
||||
\alpha^B_t = \sup_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_1, \dots, \pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_1, \dots, \pmb{o}_t | x'_t, \mathbb{D}_t]}
|
||||
\alpha^B_t = \sup_{x_t, x'_t, (\pmb{o}_i)_{i \in [\min(T), t]}} \ln \frac{\Pr[(\pmb{o}_i)_{i \in [\min(T), t]} | x_t, \mathbb{D}_t]}{\Pr[(\pmb{o}_i)_{i \in [\min(T), t]} | x'_t, \mathbb{D}_t]}
|
||||
\end{equation}
|
||||
|
||||
\end{definition}
|
||||
%
|
||||
From differential privacy we have the assumption that $\pmb{o}_1$, \dots, $\pmb{o}_t$ are independent events.
|
||||
From differential privacy we have the assumption that $(\pmb{o}_i)_{i \in [\min(T), t]}$ are independent events.
|
||||
Therefore, according to the Bayesian theorem, we can write Equation~\ref{eq:bpl-1} as:
|
||||
|
||||
\begin{align}
|
||||
\label{eq:bpl-2}
|
||||
(\ref{eq:bpl-1}) = & \sup_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1}| x_t, \mathbb{D}_t] \Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x'_t, \mathbb{D}_t] \Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]} \nonumber \\
|
||||
= & \sup_{x_t, x_t', \pmb{o}_1, \dots, \pmb{o}_{t - 1}} \ln \frac{\Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x_t', \mathbb{D}_t]} \nonumber \\
|
||||
(\ref{eq:bpl-1}) = & \sup_{x_t, x'_t, (\pmb{o}_i)_{i \in [\min(T), t]}} \ln \frac{\Pr[(\pmb{o}_i)_{i \in [\min(T), t - 1]}| x_t, \mathbb{D}_t] \Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[(\pmb{o}_i)_{i \in [\min(T), t - 1]} | x'_t, \mathbb{D}_t] \Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]} \nonumber \\
|
||||
= & \sup_{x_t, x_t', (\pmb{o}_i)_{i \in [\min(T), t - 1]}} \ln \frac{\Pr[(\pmb{o}_i)_{i \in [\min(T), t - 1]} | x_t, \mathbb{D}_t]}{\Pr[(\pmb{o}_i)_{i \in [\min(T), t - 1]} | x_t', \mathbb{D}_t]} \nonumber \\
|
||||
& + \sup_{x_t, x_t', \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]}
|
||||
\end{align}
|
||||
%
|
||||
Applying the law of total probability to the first term of Equation~\ref{eq:bpl-2} for all the possible data $x_{t - 1}$ (or $x'_{t - 1}$) and $\mathbb{D}_{t - 1}$ we get the following:
|
||||
Applying the law of total probability to the first term of Equation~\ref{eq:bpl-2} for all the possible data $x_{t - 1}$ (or $x'_{t - 1}$) and $\mathbb{D}_{t - 1}$ we get the following:
|
||||
|
||||
\begin{align}
|
||||
\label{eq:bpl-3}
|
||||
(\ref{eq:bpl-2}) = &
|
||||
\adjustbox{max width=0.9\linewidth}{
|
||||
$\sup\limits_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_{t - 1}} \ln \frac{\sum\limits_{x_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x_t, \mathbb{D}_t, x_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x_{t - 1}, \mathbb{D}_{t - 1} | x_t, \mathbb{D}_t]}{\sum\limits_{x'_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x'_t, \mathbb{D}_t, x'_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x'_{t - 1}, \mathbb{D}_{t - 1} | x'_t, \mathbb{D}_t]}$
|
||||
$\sup\limits_{x_t, x'_t, (\pmb{o}_i)_{i \in [\min(T), t - 1]}} \ln \frac{\sum\limits_{x_{t - 1}} \Pr[(\pmb{o}_i)_{i \in [\min(T), t - 1]} | x_t, \mathbb{D}_t, x_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x_{t - 1}, \mathbb{D}_{t - 1} | x_t, \mathbb{D}_t]}{\sum\limits_{x'_{t - 1}} \Pr[(\pmb{o}_i)_{i \in [\min(T), t - 1]} | x'_t, \mathbb{D}_t, x'_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x'_{t - 1}, \mathbb{D}_{t - 1} | x'_t, \mathbb{D}_t]}$
|
||||
} \nonumber \\
|
||||
& \adjustbox{max width=0.3\linewidth}{
|
||||
$+ \sup\limits_{x_t, x_t', \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]}$
|
||||
}
|
||||
\end{align}
|
||||
%
|
||||
Since $\mathbb{D}_t$ is equal to $D_t - \{x_t\}$ (or $D'_t - \{x'_t\}$), and thus is constant and independent of every possible $x_t$ (or $x'_t$), $\forall t \leq T$, Equation~\ref{eq:bpl-3} can be written as:
|
||||
Since $\mathbb{D}_t$ is equal to $D_t - \{x_t\}$ (or $D'_t - \{x'_t\}$), and thus is constant and independent of every possible $x_t$ (or $x'_t$), $\forall t \in T$, Equation~\ref{eq:bpl-3} can be written as:
|
||||
|
||||
\begin{align}
|
||||
\label{eq:bpl-4}
|
||||
(\ref{eq:bpl-3}) = &
|
||||
\adjustbox{max width=0.9\linewidth}{
|
||||
$\sup\limits_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_{t - 1}} \ln \frac{\sum\limits_{x_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x_t, \mathbb{D}_t, x_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x_{t - 1} | x_t, \mathbb{D}_t] \Pr[\mathbb{D}_{t - 1} | x_t, \mathbb{D}_t]}{\sum\limits_{x'_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x'_t, \mathbb{D}_t, x'_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x'_{t - 1} | x'_t, \mathbb{D}_t] \Pr[\mathbb{D}_{t - 1} | x'_t, \mathbb{D}_t]}$
|
||||
$\sup\limits_{x_t, x'_t, (\pmb{o}_i)_{i \in [\min(T), t - 1]}} \ln \frac{\sum\limits_{x_{t - 1}} \Pr[(\pmb{o}_i)_{i \in [\min(T), t - 1]} | x_t, \mathbb{D}_t, x_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x_{t - 1} | x_t, \mathbb{D}_t] \Pr[\mathbb{D}_{t - 1} | x_t, \mathbb{D}_t]}{\sum\limits_{x'_{t - 1}} \Pr[(\pmb{o}_i)_{i \in [\min(T), t - 1]} | x'_t, \mathbb{D}_t, x'_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x'_{t - 1} | x'_t, \mathbb{D}_t] \Pr[\mathbb{D}_{t - 1} | x'_t, \mathbb{D}_t]}$
|
||||
} \nonumber \\
|
||||
& \adjustbox{max width=0.275\linewidth}{
|
||||
$+ \sup\limits_{x_t, x_t', \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]}$
|
||||
} \nonumber \\
|
||||
= & \adjustbox{max width=0.825\linewidth}{
|
||||
$\sup\limits_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_{t - 1}} \ln \frac{\sum\limits_{x_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x_t, \mathbb{D}_t, x_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x_{t - 1} | x_t] \Pr[\mathbb{D}_{t - 1} | \mathbb{D}_t]}{\sum\limits_{x'_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x'_t, \mathbb{D}_t, x'_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x'_{t - 1} | x'_t] \Pr[\mathbb{D}_{t - 1} | \mathbb{D}_t]}$
|
||||
$\sup\limits_{x_t, x'_t, (\pmb{o}_i)_{i \in [\min(T), t - 1]}} \ln \frac{\sum\limits_{x_{t - 1}} \Pr[(\pmb{o}_i)_{i \in [\min(T), t - 1]} | x_t, \mathbb{D}_t, x_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x_{t - 1} | x_t] \Pr[\mathbb{D}_{t - 1} | \mathbb{D}_t]}{\sum\limits_{x'_{t - 1}} \Pr[(\pmb{o}_i)_{i \in [\min(T), t - 1]} | x'_t, \mathbb{D}_t, x'_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x'_{t - 1} | x'_t] \Pr[\mathbb{D}_{t - 1} | \mathbb{D}_t]}$
|
||||
} \nonumber \\
|
||||
& \adjustbox{max width=0.275\linewidth}{
|
||||
$+ \sup\limits_{x_t, x_t', \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]}$
|
||||
} \nonumber \\
|
||||
= & \adjustbox{max width=0.7\linewidth}{
|
||||
$\sup\limits_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_{t - 1}} \ln \frac{\sum\limits_{x_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x_t, \mathbb{D}_t, x_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x_{t - 1} | x_t]}{\sum\limits_{x'_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x'_t, \mathbb{D}_t, x'_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x'_{t - 1} | x'_t]}$
|
||||
$\sup\limits_{x_t, x'_t, (\pmb{o}_i)_{i \in [\min(T), t - 1]}} \ln \frac{\sum\limits_{x_{t - 1}} \Pr[(\pmb{o}_i)_{i \in [\min(T), t - 1]} | x_t, \mathbb{D}_t, x_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x_{t - 1} | x_t]}{\sum\limits_{x'_{t - 1}} \Pr[(\pmb{o}_i)_{i \in [\min(T), t - 1]} | x'_t, \mathbb{D}_t, x'_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x'_{t - 1} | x'_t]}$
|
||||
} \nonumber \\
|
||||
& \adjustbox{max width=0.275\linewidth}{
|
||||
$+ \sup\limits_{x_t, x_t', \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]}$
|
||||
}
|
||||
\end{align}
|
||||
%
|
||||
The outputs $\pmb{o}_1$, \dots, $\pmb{o}_t$ and $x_t$ (or $x'_t$) are conditionally independent in the presence of
|
||||
The outputs $(\pmb{o}_i)_{i \in [\min(T), t]}$ and $x_t$ (or $x'_t$) are conditionally independent in the presence of
|
||||
$x_{t - 1}$ (or $x'_{t - 1}$), and thus Equation~\ref{eq:bpl-4} can be written as:
|
||||
|
||||
\begin{align}
|
||||
\label{eq:bpl-5}
|
||||
(\ref{eq:bpl-4}) = &
|
||||
\adjustbox{max width=0.9\linewidth}{
|
||||
$\sup\limits_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_{t - 1}} \ln \cfrac{\sum\limits_{x_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x_{t - 1} | x_t]}{\sum\limits_{x'_{t - 1}} \underbrace{\Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x'_{t - 1}, \mathbb{D}_{t - 1}]}_{\alpha^B_{t - 1}} \underbrace{\Pr[x'_{t - 1} | x'_t]}_{P^B_{t - 1}}}$
|
||||
$\sup\limits_{x_t, x'_t, (\pmb{o}_i)_{i \in [\min(T), t - 1]}} \ln \cfrac{\sum\limits_{x_{t - 1}} \Pr[(\pmb{o}_i)_{i \in [\min(T), t - 1]} | x_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x_{t - 1} | x_t]}{\sum\limits_{x'_{t - 1}} \underbrace{\Pr[(\pmb{o}_i)_{i \in [\min(T), t - 1]} | x'_{t - 1}, \mathbb{D}_{t - 1}]}_{\alpha^B_{t - 1}} \underbrace{\Pr[x'_{t - 1} | x'_t]}_{P^B_{t - 1}}}$
|
||||
} \nonumber \\
|
||||
& \adjustbox{max width=0.4\linewidth}{
|
||||
$+ \underbrace{\sup\limits_{x_t, x'_t, \pmb{o}_t} \ln \cfrac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]}}_{\varepsilon_t}$
|
||||
@ -174,11 +175,11 @@ $x_{t - 1}$ (or $x'_{t - 1}$), and thus Equation~\ref{eq:bpl-4} can be written a
|
||||
\begin{definition}
|
||||
[Forward privacy loss (FPL)]
|
||||
\label{def:fpl}
|
||||
The potential privacy loss of a privacy mechanism at a timestamp $t \leq T$ due to outputs $\pmb{o}_t$,\dots,$\pmb{o}_T$ and temporal correlation in its input $D_t$ with respect to any adversary, targeting an individual with potential data item $x_t$ (or $x'_t$) and having knowledge $\mathbb{D}_t$ equal to $D_t - \{x_t\}$ (or $D'_t - \{x'_t\}$), is called forward privacy loss and is defined as:
|
||||
The potential privacy loss of a privacy mechanism at a timestamp $t \in T$ due to outputs $(\pmb{o}_i)_{i \in [t, \max(T)]}$ and temporal correlation in its input $D_t$ with respect to any adversary, targeting an individual with potential data item $x_t$ (or $x'_t$) and having knowledge $\mathbb{D}_t$ equal to $D_t - \{x_t\}$ (or $D'_t - \{x'_t\}$), is called forward privacy loss and is defined as:
|
||||
|
||||
\begin{equation}
|
||||
\label{eq:fpl-1}
|
||||
\alpha^F_t = \sup_{x_t, x'_t, \pmb{o}_t, \dots, \pmb{o}_T} \ln \frac{\Pr[\pmb{o}_t, \dots, \pmb{o}_T | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t, \dots, \pmb{o}_T | x'_t, \mathbb{D}_t]}
|
||||
\alpha^F_t = \sup_{x_t, x'_t, (\pmb{o}_i)_{i \in [t, \max(T)]}} \ln \frac{\Pr[(\pmb{o}_i)_{i \in [t, \max(T)]} | x_t, \mathbb{D}_t]}{\Pr[(\pmb{o}_i)_{i \in [t, \max(T)]} | x'_t, \mathbb{D}_t]}
|
||||
\end{equation}
|
||||
\end{definition}
|
||||
%
|
||||
@ -188,7 +189,7 @@ Similar to the way that we concluded to Equation~\ref{eq:bpl-5} from Equation~\r
|
||||
\label{eq:fpl-2}
|
||||
(\ref{eq:fpl-1}) = &
|
||||
\adjustbox{max width=0.9\linewidth}{
|
||||
$\sup\limits_{x_t, x'_t, \pmb{o}_{t + 1}, \dots, \pmb{o}_T} \ln \cfrac{\sum\limits_{x_{t + 1}} \Pr[\pmb{o}_{t + 1}, \dots, \pmb{o}_T | x_{t + 1}, \mathbb{D}_{t + 1}] \Pr[x_{t + 1} | x_t]}{\sum\limits_{x'_{t + 1}} \underbrace{\Pr[\pmb{o}_{t + 1}, \dots, \pmb{o}_T | x'_{t + 1}, \mathbb{D}_{t + 1}]}_{\alpha^F_{t + 1}} \underbrace{\Pr[x'_{t + 1} | x'_t]}_{P^F_{t + 1}}}$
|
||||
$\sup\limits_{x_t, x'_t, (\pmb{o}_i)_{i \in [t + 1, \max(T)]}} \ln \cfrac{\sum\limits_{x_{t + 1}} \Pr[(\pmb{o}_i)_{i \in [t + 1, \max(T)]} | x_{t + 1}, \mathbb{D}_{t + 1}] \Pr[x_{t + 1} | x_t]}{\sum\limits_{x'_{t + 1}} \underbrace{\Pr[(\pmb{o}_i)_{i \in [t + 1, \max(T)]} | x'_{t + 1}, \mathbb{D}_{t + 1}]}_{\alpha^F_{t + 1}} \underbrace{\Pr[x'_{t + 1} | x'_t]}_{P^F_{t + 1}}}$
|
||||
} \nonumber \\
|
||||
& \adjustbox{max width=0.4\linewidth}{
|
||||
$+ \underbrace{\sup\limits_{x_t, x'_t, \pmb{o}_t} \ln \cfrac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]}}_{\varepsilon_t}$
|
||||
@ -201,8 +202,8 @@ Therefore, we calculate the extra privacy loss under temporal correlation, due t
|
||||
More specifically, the calculation of TPL (Equation~\ref{eq:tpl-1}) becomes:
|
||||
\begin{align}
|
||||
\label{eq:tpl-local}
|
||||
& \underbrace{\sup_{D_t, D'_t, \pmb{o}_1, \dots, \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_1, \dots, \pmb{o}_t| D_t]}{\Pr[\pmb{o}_1, \dots, \pmb{o}_t | D'_t]}}_{\text{Backward privacy loss}\ (\alpha^B_t)} \nonumber \\
|
||||
& + \underbrace{\sup_{D_t, D'_t, \pmb{o}_t, \dots, \pmb{o}_T} \ln \frac{\Pr[\pmb{o}_t, \dots, \pmb{o}_T| D_t]}{\Pr[\pmb{o}_t, \dots, \pmb{o}_T | D'_t]}}_{\text{Forward privacy loss}\ (\alpha^F_t)} \nonumber \\
|
||||
& \underbrace{\sup_{D_t, D'_t, (\pmb{o}_i)_{i \in [\min(T), t]}} \ln \frac{\Pr[(\pmb{o}_i)_{i \in [\min(T), t]}| D_t]}{\Pr[(\pmb{o}_i)_{i \in [\min(T), t]} | D'_t]}}_{\text{Backward privacy loss}\ (\alpha^B_t)} \nonumber \\
|
||||
& + \underbrace{\sup_{D_t, D'_t, (\pmb{o}_i)_{i \in [t, \max(T)]}} \ln \frac{\Pr[(\pmb{o}_i)_{i \in [t, \max(T)]}| D_t]}{\Pr[(\pmb{o}_i)_{i \in [t, \max(T)]} | D'_t]}}_{\text{Forward privacy loss}\ (\alpha^F_t)} \nonumber \\
|
||||
& - \underbrace{\sup_{D_t, D'_t, \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_t | D_t]}{\Pr[\pmb{o}_t | D'_t]}}_{\text{Present privacy loss}\ (\varepsilon_t)}
|
||||
\end{align}
|
||||
%
|
||||
@ -210,7 +211,7 @@ The calculation of BPL (Equation~\ref{eq:bpl-5}) becomes:
|
||||
\begin{align}
|
||||
\label{eq:bpl-local}
|
||||
& \adjustbox{max width=0.9\linewidth}{
|
||||
$\sup\limits_{D_t, D'_t, \pmb{o}_1, \dots, \pmb{o}_{t - 1}} \ln \cfrac{\sum\limits_{D_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | D_{t - 1}, ] \Pr[D_{t - 1} | D_t]}{\sum\limits_{D'_{t - 1}} \underbrace{\Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | D'_{t - 1}, ]}_{\alpha^B_{t - 1}} \underbrace{\Pr[D'_{t - 1} | D'_t]}_{P^B_{t - 1}}}$
|
||||
$\sup\limits_{D_t, D'_t, (\pmb{o}_i)_{i \in [\min(T), t - 1]}} \ln \cfrac{\sum\limits_{D_{t - 1}} \Pr[(\pmb{o}_i)_{i \in [\min(T), t - 1]} | D_{t - 1}, ] \Pr[D_{t - 1} | D_t]}{\sum\limits_{D'_{t - 1}} \underbrace{\Pr[(\pmb{o}_i)_{i \in [\min(T), t - 1]} | D'_{t - 1}, ]}_{\alpha^B_{t - 1}} \underbrace{\Pr[D'_{t - 1} | D'_t]}_{P^B_{t - 1}}}$
|
||||
} \nonumber \\
|
||||
& \adjustbox{max width=0.4\linewidth}{
|
||||
$+ \underbrace{\sup\limits_{D_t, D'_t, \pmb{o}_t} \ln \cfrac{\Pr[\pmb{o}_t | D_t]}{\Pr[\pmb{o}_t | D'_t]}}_{\varepsilon_t}$
|
||||
@ -221,7 +222,7 @@ The calculation of FPL (Equation~\ref{eq:fpl-2}) becomes:
|
||||
\begin{align}
|
||||
\label{eq:fpl-local}
|
||||
& \adjustbox{max width=0.9\linewidth}{
|
||||
$\sup\limits_{D_t, D'_t, \pmb{o}_{t + 1}, \dots, \pmb{o}_T} \ln \cfrac{\sum\limits_{D_{t + 1}} \Pr[\pmb{o}_{t + 1}, \dots, \pmb{o}_T | D_{t + 1}] \Pr[D_{t + 1} | D_t]}{\sum\limits_{D'_{t + 1}} \underbrace{\Pr[\pmb{o}_{t + 1}, \dots, \pmb{o}_T | D'_{t + 1}]}_{\alpha^F_{t + 1}} \underbrace{\Pr[D'_{t + 1} | D'_t]}_{P^F_{t + 1}}}$
|
||||
$\sup\limits_{D_t, D'_t, (\pmb{o}_i)_{i \in [t + 1, \max(T)]}} \ln \cfrac{\sum\limits_{D_{t + 1}} \Pr[(\pmb{o}_i)_{i \in [t + 1, \max(T)]} | D_{t + 1}] \Pr[D_{t + 1} | D_t]}{\sum\limits_{D'_{t + 1}} \underbrace{\Pr[(\pmb{o}_i)_{i \in [t + 1, \max(T)]} | D'_{t + 1}]}_{\alpha^F_{t + 1}} \underbrace{\Pr[D'_{t + 1} | D'_t]}_{P^F_{t + 1}}}$
|
||||
} \nonumber \\
|
||||
& \adjustbox{max width=0.4\linewidth}{
|
||||
$+ \underbrace{\sup\limits_{D_t, D'_t, \pmb{o}_t} \ln \cfrac{\Pr[\pmb{o}_t | D_t]}{\Pr[\pmb{o}_t | D'_t]}}_{\varepsilon_t}$
|
||||
|
@ -20,7 +20,7 @@ is known as \emph{information disclosure} and is usually categorized as (\cite{l
|
||||
In the literature, identity disclosure is also referred to as \emph{record linkage}, and presence disclosure as \emph{table linkage}.
|
||||
Notice that identity disclosure can result in attribute disclosure, and vice versa.
|
||||
|
||||
To better illustrate these definitions, we provide some examples based on Table~\ref{tab:snapshot}.
|
||||
To better illustrate these definitions, we provide some examples based on Figure~\ref{fig:snapshot}.
|
||||
Presence disclosure appears when by looking at the (privacy-protected) counts of Table~\ref{tab:snapshot-statistical}, we can guess if Quackmore has participated in Table~\ref{tab:snapshot-micro}.
|
||||
Identity disclosure appears when we can guess that the sixth record of (a privacy-protected version of) the microdata of Table~\ref{tab:snapshot-micro} belongs to Quackmore.
|
||||
Attribute disclosure appears when it is revealed from (a privacy-protected version of) the microdata of Table~\ref{tab:snapshot-micro} that Quackmore is $62$ years old.
|
||||
@ -72,29 +72,30 @@ In order to better protect the privacy of Donald in case of attacks, the data sh
|
||||
% More specifically, i
|
||||
In continuous data publishing we consider the privacy protection level with respect to not only the users, but also to the \emph{events} occurring in the data.
|
||||
An event is a pair of an identifying attribute of an individual and the sensitive data (including contextual information) and we can see it as a correspondence to a record in a database, where each individual may participate once.
|
||||
Data publishers typically release events in the form of sequences of data items, usually indexed in time order (time series) and geotagged, e.g.,~(`Dewey', `at home at Montmartre at $t_1$'), \dots, (`Quackmore', `dining at Opera at $t_1$').
|
||||
Data publishers typically release events in the form of sequences of data items, usually indexed in time order (time series) and geotagged, e.g.,~(`Dewey', `at home at Montmartre at $t_1$'), \dots, (`Quackmore', `dining at Opéra at $t_1$').
|
||||
We use the term `users' to refer to the \emph{individuals}, also known as \emph{participants}, who are the source of the processed and published data.
|
||||
Therefore, they should not be confused with the consumers of the released data sets.
|
||||
Users are subject to privacy attacks, and thus are the main point of interest of privacy protection mechanisms.
|
||||
In more detail, the privacy protection levels are:
|
||||
The possible privacy protection levels are the \emph{event}~\cite{dwork2010differential, dwork2010pan}, \emph{user}~\cite{dwork2010differential, dwork2010pan}, and \emph{$w$-event}~\cite{kellaris2014differentially}.
|
||||
|
||||
\begin{enumerate}[(a)]
|
||||
\item \emph{Event}~\cite{dwork2010differential, dwork2010pan}---limits the privacy protection to \emph{any single event} in a time series, providing high
|
||||
\item \emph{Event-level}~\cite{dwork2010differential, dwork2010pan} limits the privacy protection to \emph{any single event} in a time series, providing high
|
||||
% \kat{maximum? better say high}
|
||||
data utility.
|
||||
\item \emph{$w$-event}~\cite{kellaris2014differentially}---provides privacy protection to \emph{any sequence of $w$ events} in a time series.
|
||||
\item \emph{User}~\cite{dwork2010differential, dwork2010pan}---protects \emph{all the events} in a time series, providing high
|
||||
\item \emph{User-level}~\cite{dwork2010differential, dwork2010pan} protects \emph{all the events} in a time series, providing high
|
||||
\item \emph{$w$-event-level}~\cite{kellaris2014differentially} provides privacy protection to \emph{any sequence of $w$ events} in a time series.
|
||||
% \kat{maximum? better say high}
|
||||
privacy protection.
|
||||
\end{enumerate}
|
||||
|
||||
Figure~\ref{fig:prv-levels} demonstrates the application of the possible protection levels on the statistical data of Example~\ref{ex:continuous}.
|
||||
For instance, in event-level (Figure~\ref{fig:level-event}) it is hard to determine whether Quackmore was dining at Opera at $t_1$.
|
||||
For instance, in event-level (Figure~\ref{fig:level-event}) it is hard to determine whether Quackmore was dining at Opéra at $t_1$.
|
||||
Moreover, in user-level (Figure~\ref{fig:level-user}) it is hard to determine whether Quackmore was ever included in the released series of events at all.
|
||||
Finally, in $2$-event-level (Figure~\ref{fig:level-w-event}) it is hard to determine whether Quackmore was ever included in the released series of events between the timestamps $t_1$ and $t_2$, $t_2$ and $t_3$, etc. (i.e.,~for a window $w = 2$).
|
||||
\kat{Already, by looking at the original counts, for the reader it is hard to see if Quackmore was in the event/database. So, we don't really get the difference among the different levels here.}
|
||||
\mk{It is without background knowledge.}
|
||||
\kat{But you discuss event and level here by showing just counts, with no background knowledge, and you want the reader to understand how in one case we are not sure if he participated in the event t1 or in any of the events. It is not clear to me what is the difference, just by looking at the example with the counts. }
|
||||
% \kat{Already, by looking at the original counts, for the reader it is hard to see if Quackmore was in the event/database. So, we don't really get the difference among the different levels here.}
|
||||
% \mk{It is without background knowledge.}
|
||||
% \kat{But you discuss event and level here by showing just counts, with no background knowledge, and you want the reader to understand how in one case we are not sure if he participated in the event t1 or in any of the events. It is not clear to me what is the difference, just by looking at the example with the counts. }
|
||||
% \mk{I'll check again later}
|
||||
|
||||
\begin{figure}[htp]
|
||||
\centering
|
||||
@ -140,22 +141,25 @@ We identify the following privacy operations that can be applied on the original
|
||||
% \mk{``granularity''?}
|
||||
|
||||
\begin{itemize}
|
||||
\item \emph{Aggregation}---combine
|
||||
\item \emph{Aggregation} combines
|
||||
% group
|
||||
% \kat{or combine? also maybe mention that the single value will replace the values of a specific attribute of these rows}
|
||||
% together
|
||||
multiple rows of a data set to form a single value which will replace these rows.
|
||||
\item \emph{Generalization}---replace an attribute value with a parent value in the attribute taxonomy (when applicable).
|
||||
\item \emph{Generalization} replaces an attribute value with a parent value in the attribute taxonomy (when applicable).
|
||||
% Notice that a step of generalization, may be followed by a step of \emph{specialization}, to improve the quality of the resulting data set.
|
||||
% \kat{This technical detail is not totally clear at this point. Either elaborate or remove.}
|
||||
% \mk{I cannot remember coming across it in the literature.}
|
||||
\item \emph{Suppression}---delete completely certain sensitive values or entire records.
|
||||
\item \emph{Perturbation}---disturb the initial attribute value in a deterministic or probabilistic way.
|
||||
\item \emph{Suppression} deletes completely certain sensitive values or entire records.
|
||||
\item \emph{Perturbation} disturbs the initial attribute value in a deterministic or probabilistic way.
|
||||
The probabilistic data distortion is referred to as \emph{randomization}.
|
||||
\end{itemize}
|
||||
|
||||
For example, consider the table schema \emph{User(Name, Age, Location, Status)}.
|
||||
If we want to protect the \emph{Age} of the user by aggregation, we may replace it by the average age in her Location\kat{This example does not follow the description you give before for aggregation. Indeed, it fits better the perturbation (you replaced the value with the average age of the same location, which is a deterministic process). Don't you mean counts by aggregation? If you mean aggregation as in sql functions then you should not say in the definition that you replace the rows with the aggregate, but a specific attribute's value. }; by generalization, we may replace the Age by age intervals; by suppression we may delete the entire table column corresponding to \emph{Age}; by perturbation, we may augment each age by a predefined percentage of the age; by randomization we may randomly replace each age by a value taken from the probability density function of the attribute.
|
||||
If we want to protect the \emph{Age} of the user by aggregation, we may
|
||||
% replace it by the average age in her Location\kat{This example does not follow the description you give before for aggregation. Indeed, it fits better the perturbation (you replaced the value with the average age of the same location, which is a deterministic process). Don't you mean counts by aggregation? If you mean aggregation as in sql functions then you should not say in the definition that you replace the rows with the aggregate, but a specific attribute's value. }
|
||||
group the data by Location and report the average Age for each group;
|
||||
by generalization, we may replace the Age by Age intervals; by suppression we may delete the entire table column corresponding to Age; by perturbation, we may augment each Age by a predefined percentage of the Age; by randomization we may randomly replace each Age by a value taken from the probability density function of the attribute.
|
||||
|
||||
It is worth mentioning that there is a series of algorithms (e.g.,~\cite{benaloh2009patient, kamara2010cryptographic, cao2014privacy}) based on the \emph{cryptography} operation.
|
||||
However, the majority of these methods, among other assumptions that they make, have minimum or even no trust to the entities that handle the personal information.
|
||||
@ -183,7 +187,8 @@ of privacy-preserving continuous publishing as well.
|
||||
|
||||
Sweeney coined \emph{$k$-anonymity}~\cite{sweeney2002k}, one of the first established works on data privacy.
|
||||
A released data set features $k$-anonymity protection when the values of a set of identifying attributes, called the \emph{quasi-identifiers}, is the same for at least $k$ records in the data set.
|
||||
Computing the quasi-identifiers in a set of attributes is still a hard problem on its own~\cite{motwani2007efficient}.\kat{yes indeed, but seems out of context here.}
|
||||
% Computing the quasi-identifiers in a set of attributes is still a hard problem on its own~\cite{motwani2007efficient}.
|
||||
% \kat{yes indeed, but seems out of context here.}
|
||||
% $k$-anonymity
|
||||
% is syntactic,
|
||||
% \kat{meaning?}
|
||||
@ -212,7 +217,8 @@ Proposed solutions include rearranging the attributes, setting the whole attribu
|
||||
\label{subsec:prv-statistical}
|
||||
|
||||
While methods based on $k$-anonymity have been mainly employed for releasing microdata, \emph{differential privacy}~\cite{dwork2006calibrating} has been proposed for releasing high utility aggregates over microdata while providing semantic
|
||||
\kat{semantic ?}
|
||||
% \kat{semantic ?}
|
||||
% \mk{Yes, explainted by the following}
|
||||
privacy guarantees that characterize the output data.
|
||||
Differential privacy is algorithmic,
|
||||
% \kat{algorithmic? moreover, you repeat this sentence later on, after the definition of neighboring datasets}
|
||||
@ -227,7 +233,7 @@ It ensures that any adversary observing a privacy-protected output, no matter th
|
||||
\end{definition}
|
||||
|
||||
Moreover, differential privacy quantifies and bounds the impact that the addition/removal of an individual to/from a data set has on the derived privacy-protected aggregates thereof.
|
||||
More precisely, differential privacy quantifies the impact of the addition/removal of a single tuple in $D$ on the output $\pmb{o}$ of a privacy mechanism $\mathcal{M}$.
|
||||
More precisely, differential privacy quantifies the impact of the addition/removal of a single tuple in $D$ on the output $\pmb{o}$ of a privacy mechanism $\mathcal{M}$ that perturbs the result of a query function $f$.
|
||||
% \kat{what is M?}
|
||||
The distribution of all $\pmb{o}$, in some range $\mathcal{O}$, is not affected \emph{substantially}, i.e.,~it changes only slightly due to the modification of any one tuple in all possible $D \in \mathcal{D}$.
|
||||
Formally, differential privacy is given in Definition~\ref{def:dp}.
|
||||
@ -243,7 +249,7 @@ Formally, differential privacy is given in Definition~\ref{def:dp}.
|
||||
% ensures that any adversary observing any $\pmb{o}$ cannot conclude with absolute certainty whether or not any individual is included in any $D$.
|
||||
% Its performance is irrelevant to the computational power and auxiliary information available to an adversary observing the outputs of $\mathcal{M}$.
|
||||
% \kat{you already said this. Moreover, it is irrelevant to the neighboring datasets and thus does not fit here..}
|
||||
\kat{Say what is a mechanism and how it is connected to the query, what are their differences? In the next section that you speak about the examples, we are still not sure about what is a mechanism in general.}
|
||||
% \kat{Say what is a mechanism and how it is connected to the query, what are their differences? In the next section that you speak about the examples, we are still not sure about what is a mechanism in general.}
|
||||
\begin{definition}
|
||||
[Differential privacy]
|
||||
\label{def:dp}
|
||||
@ -279,7 +285,6 @@ However, sum, max, and in some cases average
|
||||
% \kat{and average }
|
||||
queries can be problematic, since a single, outlier value could change the output noticeably, making it necessary to add a lot of noise to the query's answer.
|
||||
|
||||
|
||||
% \kat{introduce and link to the previous text the following definition }
|
||||
|
||||
\begin{definition}
|
||||
@ -289,6 +294,15 @@ queries can be problematic, since a single, outlier value could change the outpu
|
||||
$$\Delta f = \max_{D, D' \in \mathcal{D}} \lVert {f(D) - f(D')} \rVert_{1}$$
|
||||
\end{definition}
|
||||
|
||||
The notion of differential privacy has highly influenced the research community, resulting in many follow-up publications (\cite{mcsherry2007mechanism, kifer2011no, zhang2017privbayes} to mention a few).
|
||||
We distinguish here \emph{Pufferfish}~\cite{kifer2014pufferfish}.
|
||||
\emph{Pufferfish} is a framework that allows experts in an application domain, without necessarily having any particular expertise in privacy, to develop privacy definitions for their data sharing needs.
|
||||
To define a privacy mechanism using \emph{Pufferfish}, one has to define a set of potential secrets $\mathcal{X}$, a set of distinct pairs $\mathcal{X}_{pairs}$, and auxiliary information about data evolution scenarios $\mathcal{B}$.
|
||||
$\mathcal{X}$ serves as an explicit specification of what we would like to protect, e.g.,~`the record of an individual $x$ is (not) in the data'.
|
||||
$\mathcal{X}_{pairs}$ is a subset of $\mathcal{X} \times \mathcal{X}$ that instructs how to protect the potential secrets $\mathcal{X}$, e.g.,~(`$x$ is in the table', `$x$ is not in the table').
|
||||
Finally, $\mathcal{B}$ is a set of conservative assumptions about how the data evolved (or were generated) that reflects the adversary's belief about the data, e.g.,~probability distributions, variable correlations, etc.
|
||||
When there is independence between all the records in the original data set, then $\varepsilon$-differential privacy and the privacy definition of $\varepsilon$-\emph{Pufferfish}$(\mathcal{X}, \mathcal{X}_{pairs}, \mathcal{B})$ are equivalent.
|
||||
|
||||
|
||||
\paragraph{Popular privacy mechanisms}
|
||||
\label{subsec:prv-mech}
|
||||
@ -297,7 +311,6 @@ A typical example of a differential privacy mechanism is the \emph{Laplace mecha
|
||||
It draws randomly a value from the probability distribution of $\textrm{Laplace}(\mu, b)$, where $\mu$ stands for the location parameter and $b > 0$ is the scale parameter (Figure~\ref{fig:laplace}).
|
||||
In our case, $\mu$ is equal to the original output value of a query function, and $b$ is the sensitivity of the query function divided by $\varepsilon$.
|
||||
The Laplace mechanism works for any function with range the set of real numbers.
|
||||
A specialization of this mechanism for location data is the \emph{Planar Laplace mechanism}~\cite{andres2013geo}, which is based on a multivariate Laplace distribution.
|
||||
|
||||
\begin{figure}[htp]
|
||||
\centering
|
||||
@ -306,9 +319,22 @@ A specialization of this mechanism for location data is the \emph{Planar Laplace
|
||||
\label{fig:laplace}
|
||||
\end{figure}
|
||||
|
||||
A specialization of this mechanism for location data is the \emph{Planar Laplace mechanism}~\cite{andres2013geo,chatzikokolakis2015geo},
|
||||
% which is based on a multivariate Laplace distribution.
|
||||
% \emph{Geo-indistinguishability} is
|
||||
an adaptation of differential privacy for location data in snapshot publishing.
|
||||
It is based on $l$-privacy, which offers to individuals within an area with radius $r$, a privacy level of $l$.
|
||||
More specifically, $l$ is equal to $\varepsilon r$ if any two locations within distance $r$ provide data with similar distributions.
|
||||
This similarity depends on $r$ because the closer two locations are, the more likely they are to share the same features.
|
||||
Intuitively, the definition implies that if an adversary learns the published location for an individual, the adversary cannot infer the individual's true location, out of all the points in a radius $r$, with a certainty higher than a factor depending on $l$.
|
||||
The technique adds random noise drawn from a multivariate Laplace distribution to individuals' locations, while taking into account spatial boundaries and features.
|
||||
|
||||
For query functions that do not return a real number, e.g.,~`What is the most visited country this year?' or in cases where perturbing the value of the output will completely destroy its utility, e.g.,~`What is the optimal price for this auction?' most works in the literature use the \emph{Exponential mechanism}~\cite{mcsherry2007mechanism}.
|
||||
This mechanism utilizes a utility function $u$ that maps (input data set $D$, output value $r$) pairs to utility scores, and selects an output value $r$ from the input pairs with probability proportional to $\exp(\frac{\varepsilon u(D, r)}{2\Delta u})$.
|
||||
$\Delta u$ is the sensitivity of the utility \kat{what is the utility function?} function.
|
||||
$\Delta u$ is the sensitivity of the utility
|
||||
% \kat{what is the utility function?}
|
||||
% \mk{Already explained}
|
||||
function.
|
||||
|
||||
Another technique for differential privacy mechanisms is the \emph{randomized response}~\cite{warner1965randomized}.
|
||||
It is a privacy-preserving survey method that introduces probabilistic noise to the statistics of a research by randomly instructing respondents to answer truthfully or `Yes' to a sensitive, binary question.
|
||||
@ -316,31 +342,24 @@ The technique achieves this randomization by including a random event, e.g.,~the
|
||||
The respondents reveal to the interviewers only their answer to the question, and keep as a secret the result of the random event (i.e.,~if the coin was tails or heads).
|
||||
Thereafter, the interviewers can calculate the probability distribution of the random event, e.g.,~$\frac{1}{2}$ heads and $\frac{1}{2}$ tails, and thus they can roughly eliminate the false responses and estimate the final result of the research.
|
||||
|
||||
\kat{is the following two paragraphs still part of the examples of privacy mechanisms? I am little confused here.. if the section is not only for examples, then you should introduce it somehow (and not start directly by saying 'A typical example...')}
|
||||
% \kat{is the following two paragraphs still part of the examples of privacy mechanisms? I am little confused here.. if the section is not only for examples, then you should introduce it somehow (and not start directly by saying 'A typical example...')}
|
||||
|
||||
A special category of differential privacy-preserving algorithms \kat{algorithms? why not mechanisms ?} is that of \emph{pan-private} algorithms~\cite{dwork2010pan}.
|
||||
A special category of differential privacy-preserving
|
||||
% algorithms
|
||||
% \kat{algorithms? why not mechanisms ?}
|
||||
mechanisms
|
||||
is that of \emph{pan-private} algorithms~\cite{dwork2010pan}.
|
||||
Pan-private algorithms hold their privacy guarantees even when snapshots of their internal state (memory) are accessed during their execution by an external entity, e.g.,~subpena, security breach, etc.
|
||||
There are two intrusion types that a data publisher has to deal with when designing a pan-private mechanism: \emph{single unannounced}, and \emph{continual announced} intrusion.
|
||||
In the first, the data publisher assumes that the mechanism's state is observed by the external entity one unique time, without the data publisher ever being notified about it.
|
||||
In the latter, the external entity gains access to the mechanism's state multiple times, and the publisher is notified after each time.
|
||||
The simplest approach to deal with both cases is to make sure that the data in the memory of the mechanism have constantly the same distribution, i.e.,~they are differentially private.
|
||||
Notice that this must hold throughout the mechanism's lifetime, even before/\allowbreak after it processes any sensitive data item(s). \kat{what do you mean here? even if it processes non-sensitive items before or after?}
|
||||
Notice that this must hold throughout the mechanism's lifetime, even before/after it processes any sensitive data item(s).
|
||||
% \kat{what do you mean here? even if it processes non-sensitive items before or after?}
|
||||
% \mk{Yes}
|
||||
|
||||
% \kat{The way you start this paragraph is more suited for the related work. If you want to present Pufferfish as a background knowledge, do it directly. But in my opinion, since you do not use it for your work, there is no meaning for putting this in your background section. Mentioning it in the related work is sufficient. Same for geo-indistinguishability. }
|
||||
|
||||
\kat{The way you start this paragraph is more suited for the related work. If you want to present Pufferfish as a background knowledge, do it directly. But in my opinion, since you do not use it for your work, there is no meaning for putting this in your background section. Mentioning it in the related work is sufficient. Same for geo-indistinguishability. }
|
||||
The notion of differential privacy has highly influenced the research community, resulting in many follow-up publications (\cite{mcsherry2007mechanism, kifer2011no, zhang2017privbayes} to mention a few).
|
||||
We distinguish here \emph{Pufferfish}~\cite{kifer2014pufferfish} and \emph{geo-indistinguishability}~\cite{andres2013geo,chatzikokolakis2015geo}.
|
||||
\emph{Pufferfish} is a framework that allows experts in an application domain, without necessarily having any particular expertise in privacy, to develop privacy definitions for their data sharing needs.
|
||||
To define a privacy mechanism using \emph{Pufferfish}, one has to define a set of potential secrets $\mathcal{X}$, a set of distinct pairs $\mathcal{X}_{pairs}$, and auxiliary information about data evolution scenarios $\mathcal{B}$.
|
||||
$\mathcal{X}$ serves as an explicit specification of what we would like to protect, e.g.,~`the record of an individual $x$ is (not) in the data'.
|
||||
$\mathcal{X}_{pairs}$ is a subset of $\mathcal{X} \times \mathcal{X}$ that instructs how to protect the potential secrets $\mathcal{X}$, e.g.,~(`$x$ is in the table', `$x$ is not in the table').
|
||||
Finally, $\mathcal{B}$ is a set of conservative assumptions about how the data evolved (or were generated) that reflects the adversary's belief about the data, e.g.,~probability distributions, variable correlations, etc.
|
||||
When there is independence between all the records in the original data set, then $\varepsilon$-differential privacy and the privacy definition of $\varepsilon$-\emph{Pufferfish}$(\mathcal{X}, \mathcal{X}_{pairs}, \mathcal{B})$ are equivalent.
|
||||
\emph{Geo-indistinguishability} is an adaptation of differential privacy for location data in snapshot publishing.
|
||||
It is based on $l$-privacy, which offers to individuals within an area with radius $r$, a privacy level of $l$.
|
||||
More specifically, $l$ is equal to $\varepsilon r$ if any two locations within distance $r$ provide data with similar distributions.
|
||||
This similarity depends on $r$ because the closer two locations are, the more likely they are to share the same features.
|
||||
Intuitively, the definition implies that if an adversary learns the published location for an individual, the adversary cannot infer the individual's true location, out of all the points in a radius $r$, with a certainty higher than a factor depending on $l$.
|
||||
The technique adds random noise drawn from a multivariate Laplace distribution to individuals' locations, while taking into account spatial boundaries and features.
|
||||
|
||||
\bigskip
|
||||
|
||||
@ -426,8 +445,9 @@ However, the \emph{post-processing} of a perturbed data set can be done without
|
||||
The post-processing of any output of an $\varepsilon$-differential privacy mechanism shall not deteriorate its privacy guarantee.
|
||||
\end{theorem}
|
||||
|
||||
Naturally, using the same (or different) privacy mechanism(s) multiple times to interact with raw data in combination with already perturbed data implies that the privacy guarantee of the final output will be calculated according to Theorem~\ref{theor:compo-seq-ind}. \kat{can you be more explicit here? Do you mean that only the consumption of budget on the raw data will be taken into account? And that the queries over the results do not count?}
|
||||
|
||||
Naturally, using the same (or different) privacy mechanism(s) multiple times to interact with raw data in combination with already perturbed data implies that the privacy guarantee of the final output will be calculated according to Theorem~\ref{theor:compo-seq-ind}.
|
||||
% \kat{can you be more explicit here? Do you mean that only the consumption of budget on the raw data will be taken into account? And that the queries over the results do not count?}
|
||||
That is, we add up the privacy budgets attributed to the outputs from previous mechanism applications with the current privacy budget.
|
||||
|
||||
\begin{example}
|
||||
\label{ex:application}
|
||||
@ -445,7 +465,7 @@ Naturally, using the same (or different) privacy mechanism(s) multiple times to
|
||||
The Age and Location attributes are the quasi-identifiers, so we proceed to adequately generalize them.
|
||||
We turn age values to ranges ($\leq 20$, and $> 20$), and generalize location to city level (Paris).
|
||||
Finally, we achieve $3$-anonymity by putting the entries in groups of three, according to the quasi-identifiers.
|
||||
Table~\ref{tab:scenario-micro} depicts the results at each timestamp.
|
||||
Figure~\ref{fig:scenario-micro} depicts the results at each timestamp.
|
||||
|
||||
\includetable{preliminaries/scenario-statistical}
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user