privacy: Addressed \kat{} in statistical

This commit is contained in:
Manos Katsomallos 2021-09-03 13:46:33 +03:00
parent 084b2faf2d
commit a78c03127d

View File

@ -152,11 +152,19 @@ Proposed solutions include rearranging the attributes, setting the whole attribu
\subsubsection{Statistical data}
\label{subsec:prv-statistical}
While methods based on $k$-anonymity have been mainly employed for releasing microdata, \emph{differential privacy}~\cite{dwork2006calibrating} has been proposed for releasing high utility aggregates over microdata while providing semantic\kat{semantic ?} privacy guarantees.
Differential privacy is algorithmic \kat{algorithmic? moreover, you repeat this sentence later on, after the definition of neighboring datasets}, it ensures that any adversary observing a privacy-protected output, no matter his/her computational power or auxiliary information, cannot conclude with absolute certainty if an individual is included in the input data set.
While methods based on $k$-anonymity have been mainly employed for releasing microdata, \emph{differential privacy}~\cite{dwork2006calibrating} has been proposed for releasing high utility aggregates over microdata while providing semantic
% \kat{semantic ?}
privacy guarantees that characterize the output data.
Differential privacy is algorithmic,
% \kat{algorithmic? moreover, you repeat this sentence later on, after the definition of neighboring datasets}
it characterizes the data publishing process which passes its privacy guarantee to the resulting data.
It ensures that any adversary observing a privacy-protected output, no matter their computational power or auxiliary information, cannot conclude with absolute certainty if an individual is included in the input data set.
Moreover, it quantifies and bounds the impact that the addition/removal of an individual to/from a data set has on the derived privacy-protected aggregates thereof.
More precisely, differential privacy quantifies the impact of the addition/removal of a single tuple in $D$ on the output $\pmb{o}$ of a privacy mechanism $\mathcal{M}$.
% \kat{what is M?}
The distribution of all $\pmb{o}$, in some range $\mathcal{O}$, is not affected \emph{substantially}, i.e.,~it changes only slightly due to the modification of any one tuple in all possible $D \in \mathcal{D}$.
\kat{introduce the following definition, and link it to the text before. Maybe you can put the definition after the following paragraph.}
% \kat{introduce the following definition, and link it to the text before. Maybe you can put the definition after the following paragraph.}
\begin{definition}
[Neighboring data sets]
@ -164,10 +172,13 @@ Moreover, it quantifies and bounds the impact that the addition/removal of an in
Two data sets are neighboring (or adjacent) when they differ by at most one tuple, i.e.,~one can be obtained by adding/removing the data of an individual to/from the other.
\end{definition}
More precisely, differential privacy quantifies the impact of the addition/removal of a single tuple in $D$ on the output $\pmb{o}$ of $\mathcal{M}$. \kat{what is M?}
The distribution of all $\pmb{o}$, in some range $\mathcal{O}$, is not affected \emph{substantially}, i.e.,~it changes only slightly due to the modification of any one tuple in all possible $D \in \mathcal{D}$.
Thus, differential privacy is algorithmic\kat{??}, it ensures that any adversary observing any $\pmb{o}$ cannot conclude with absolute certainty whether or not any individual is included in any $D$.
Its performance is irrelevant to the computational power and auxiliary information available to an adversary observing the outputs of $\mathcal{M}$.\kat{you already said this. Moreover, it is irrelevant to the neighboring datasets and thus does not fit here..}
% Thus, differential privacy
% is algorithmic,
% \kat{??}
% it
% ensures that any adversary observing any $\pmb{o}$ cannot conclude with absolute certainty whether or not any individual is included in any $D$.
% Its performance is irrelevant to the computational power and auxiliary information available to an adversary observing the outputs of $\mathcal{M}$.
% \kat{you already said this. Moreover, it is irrelevant to the neighboring datasets and thus does not fit here..}
\begin{definition}
[Differential privacy]
@ -176,16 +187,35 @@ Its performance is irrelevant to the computational power and auxiliary informati
$$\Pr[\mathcal{M}(D) \in O] \leq e^\varepsilon \Pr[\mathcal{M}(D') \in O]$$
\end{definition}
\noindent $\Pr[\cdot]$ denotes the probability of $\mathcal{M}$ generating $\pmb{o}$ \kat{there is no o in the definition above} as output from $O \subseteq \mathcal{O}$, when given $D$ as input.
\noindent $\Pr[\cdot]$ denotes the probability of $\mathcal{M}$ generating an output
% $\pmb{o}$
% \kat{there is no o in the definition above}
% as output
from all possible $O \subseteq \mathcal{O}$, when given $D$ as input.
The \emph{privacy budget} $\varepsilon$ is a positive real number that represents the user-defined privacy goal~\cite{mcsherry2009privacy}.
As the definition implies, $\mathcal{M}$ achieves stronger privacy protection for lower values of $\varepsilon$ since the probabilities of $D$ and $D'$ being true worlds are similar, but the utility of $\pmb{o}$ \kat{there is no o in the definition above} is reduced since more randomness is introduced by $\mathcal{M}$.
As the definition implies, $\mathcal{M}$ achieves stronger privacy protection for lower values of $\varepsilon$ since the probabilities of $D$ and $D'$ being true worlds are similar, but the utility of tje output
% $\pmb{o}$
% \kat{there is no o in the definition above}
is reduced since more randomness is introduced by $\mathcal{M}$.
The privacy budget $\varepsilon$ is usually set to $0.01$, $0.1$, or, in some cases, $\ln2$ or $\ln3$~\cite{lee2011much}.
Its local variant~\cite{duchi2013local} is compatible with microdata, where $D$ is composed of a single data item and is represented by $x$.\kat{Seems out of place and needs to be described a little more..}
% Its local variant~\cite{duchi2013local} is compatible with microdata, where $D$ is composed of a single data item and is represented by $x$.\kat{Seems out of place and needs to be described a little more..}
% We refer the interested reader to~\cite{desfontaines2020sok} for a systematic taxonomy of the different variants and extensions of differential privacy.
We refer the interested reader to~\cite{desfontaines2020sok} for a systematic taxonomy of the different variants and extensions of differential privacy.
The applicability
% pertinence
% \kat{pertinence to what?}
of differential privacy mechanisms is inseparable from the query's
% \kat{here, you need to associate a mechanism M to the query, because so far you have been talking for mechanisms}
function sensitivity.
The presence/absence of a single record should only change the result slightly,
% \kat{do you want to say 'should' and not 'can'?}
and therefore differential privacy methods are best for low sensitivity queries such as counts.
However, sum, max, and in some cases average
% \kat{and average }
queries can be problematic since a single (but outlier) value could change the output noticeably, making it necessary to add a lot of noise to the query's answer.
\kat{introduce and link to the previous text the following definition }
% \kat{introduce and link to the previous text the following definition }
\begin{definition}
[Query function sensitivity]
@ -194,12 +224,6 @@ We refer the interested reader to~\cite{desfontaines2020sok} for a systematic ta
$$\Delta f = \max_{D, D' \in \mathcal{D}} \lVert {f(D) - f(D')} \rVert_{1}$$
\end{definition}
The pertinence \kat{pertinence to what?} of differential privacy methods is inseparable from the query's \kat{here, you need to associate a mechanism M to the query, because so far you have been talking for mechanisms} function sensitivity.
The presence/absence of a single record can only change the result slightly\kat{do you want to say 'should' and not 'can'?}, and therefore differential privacy methods are best for low sensitivity queries such as counts.
However, sum and max \kat{and average } queries can be problematic since a single (but outlier) value could change the output noticeably, making it necessary to add a lot of noise to the query's answer.
\kat{How does the following connects to the query's sensitivity?}Furthermore, asking a series of queries may allow the disambiguation between possible data sets, making it necessary to add even more noise to the outputs.
For this reason, after a series of queries exhausts the available privacy budget \kat{you have not talked about the sequential theorem, so this comes out of the blue} the data set has to be discarded.
\kat{THe following is an explanation of the previous. When you restate sth in different words for explanation, please say that you do so, otherwise it is not clear what new you want to convey.}Keeping the original guarantee across multiple queries that require different/new answers requires the injection of noise proportional to the number of the executed queries, and thus destroying the utility of the output.
\paragraph{Privacy mechanisms}
\label{subsec:prv-mech}
@ -269,6 +293,14 @@ Generally, when we apply a series of independent (i.e.,~in the way that they inj
The privacy guarantee of $m \in \mathbb{Z}^+$ independent privacy mechanisms, satisfying $\varepsilon_1$-, $\varepsilon_2$-, \dots, $\varepsilon_m$-differential privacy respectively, when applied over the same data set equals to $\sum_{i = 1}^m \varepsilon_i$.
\end{theorem}
% \kat{How does the following connects to the query's sensitivity?}
Asking a series of queries may allow the disambiguation between possible data sets, making it necessary to add even more noise to the outputs.
% \kat{The following is an explanation of the previous. When you restate sth in different words for explanation, please say that you do so, otherwise it is not clear what new you want to convey.}
Keeping the original guarantee across multiple queries that require different/new answers requires the injection of noise proportional to the number of the executed queries, and thus destroying the utility of the output.
For this reason, after a series of queries exhausts the available privacy budget
% \kat{you have not talked about the sequential theorem, so this comes out of the blue}
the data set has to be discarded.
Notice that the sequential composition corresponds to the worst case scenario where each time we use a mechanism we have to invest some (or all) of the available privacy budget.
In the special case that we query disjoint data sets, we can take advantage of the \emph{parallel} composition property~\cite{mcsherry2009privacy, soria2016big}, and thus spare some of the available privacy budget.