diff --git a/text/preliminaries/privacy.tex b/text/preliminaries/privacy.tex index 692d971..f3ce0ba 100644 --- a/text/preliminaries/privacy.tex +++ b/text/preliminaries/privacy.tex @@ -131,29 +131,32 @@ For completeness, in this section we present the seminal works for privacy-prese Sweeney coined \emph{$k$-anonymity}~\cite{sweeney2002k}, one of the first established works on data privacy. A released data set features $k$-anonymity protection when the sequence of values for a set of identifying attributes, called the \emph{quasi-identifiers}, is the same for at least $k$ records in the data set. Computing the quasi-identifiers in a set of attributes is still a hard problem on its own~\cite{motwani2007efficient}. -$k$-anonymity is syntactic, it constitutes an individual indistinguishable from at least $k-1$ other individuals in the same data set. +$k$-anonymity is syntactic\kat{meaning?}, it constitutes an individual indistinguishable from at least $k-1$ other individuals in the same data set.\kat{you just said this in another way,two sentences before} In a follow-up work~\cite{sweeney2002achieving}, the author describes a way to achieve $k$-anonymity for a data set by the suppression or generalization of certain values of the quasi-identifiers. -Machanavajjhala et al.~\cite{machanavajjhala2006diversity} pointed out that $k$-anonymity is vulnerable to homogeneity and background knowledge attacks. + +Several works identified and addressed privacy concerns on $k$-anonymity. Machanavajjhala et al.~\cite{machanavajjhala2006diversity} pointed out that $k$-anonymity is vulnerable to homogeneity and background knowledge attacks. Thereby, they proposed \emph{$l$-diversity}, which demands that the values of the sensitive attributes are `well-represented' by $l$ sensitive values in each group. Principally, a data set can be $l$-diverse by featuring at least $l$ distinct values for the sensitive field in each group (\emph{distinct} $l$-diversity). Other instantiations demand that the entropy of the whole data set is greater than or equal to $\log(l)$ (\emph{entropy} $l$-diversity) or that the number of appearances of the most common sensitive value is less than the sum of the counts of the rest of the values multiplied by a user defined constant $c$ (\emph{recursive (c, l)}-diversity). Later on, Li et al.~\cite{li2007t} indicated that $l$-diversity can be void by skewness and similarity attacks due to sensitive attributes with a small value range. In such cases, \emph{$\theta$-closeness} guarantees that the distribution of a sensitive attribute in a group and the distribution of the same attribute in the whole data set is `similar'. -This similarity is bounded by a threshold $\theta$. -A data set features $\theta$-closeness when all of its groups feature $\theta$-closeness. +This similarity is bound by a threshold $\theta$. +A data set features $\theta$-closeness when all of its groups satisfy $\theta$-closeness. The main drawback of $k$-anonymity (and its derivatives) is that it is not tolerant to external attacks of re-identification on the released data set. The problems identified in~\cite{sweeney2002k} appear when attempting to apply $k$-anonymity on continuous data publishing (as we will also see next in Section~\ref{sec:micro}). These attacks include multiple $k$-anonymous data set releases with the same record order, subsequent releases of a data set without taking into account previous $k$-anonymous releases, and tuple updates. -Proposed solutions include rearranging the attributes, setting the whole attribute set of previously released data sets as quasi-identifiers or releasing data based on previous $k$-anonymous releases. +Proposed solutions include rearranging the attributes, setting the whole attribute set of previously released data sets as quasi-identifiers or releasing data based on previous $k$-anonymous releases.\kat{and the citations of these solutions?} \subsubsection{Statistical data} \label{subsec:prv-statistical} -While methods based on $k$-anonymity have been mainly employed for releasing microdata, \emph{differential privacy}~\cite{dwork2006calibrating} has been proposed for releasing high utility aggregates over microdata while providing semantic privacy guarantees. -Differential privacy is algorithmic, it ensures that any adversary observing a privacy-protected output, no matter his/her computational power or auxiliary information, cannot conclude with absolute certainty if an individual is included in the input data set. -Moreover, it quantifies and bounds the impact that the addition/removal of the data of an individual to/from an input data set has on the derived privacy-protected aggregates. +While methods based on $k$-anonymity have been mainly employed for releasing microdata, \emph{differential privacy}~\cite{dwork2006calibrating} has been proposed for releasing high utility aggregates over microdata while providing semantic\kat{semantic ?} privacy guarantees. +Differential privacy is algorithmic \kat{algorithmic? moreover, you repeat this sentence later on, after the definition of neighboring datasets}, it ensures that any adversary observing a privacy-protected output, no matter his/her computational power or auxiliary information, cannot conclude with absolute certainty if an individual is included in the input data set. +Moreover, it quantifies and bounds the impact that the addition/removal of an individual to/from a data set has on the derived privacy-protected aggregates thereof. + +\kat{introduce the following definition, and link it to the text before. Maybe you can put the definition after the following paragraph.} \begin{definition} [Neighboring data sets] @@ -161,10 +164,10 @@ Moreover, it quantifies and bounds the impact that the addition/removal of the d Two data sets are neighboring (or adjacent) when they differ by at most one tuple, i.e.,~one can be obtained by adding/removing the data of an individual to/from the other. \end{definition} -More precisely, differential privacy quantifies the impact of the addition/removal of a single tuple in $D$ on the output $\pmb{o}$ of $\mathcal{M}$. +More precisely, differential privacy quantifies the impact of the addition/removal of a single tuple in $D$ on the output $\pmb{o}$ of $\mathcal{M}$. \kat{what is M?} The distribution of all $\pmb{o}$, in some range $\mathcal{O}$, is not affected \emph{substantially}, i.e.,~it changes only slightly due to the modification of any one tuple in all possible $D \in \mathcal{D}$. -Thus, differential privacy is algorithmic, it ensures that any adversary observing any $\pmb{o}$ cannot conclude with absolute certainty whether or not any individual is included in any $D$. -Its performance is irrelevant to the computational power and auxiliary information available to an adversary observing the outputs of $\mathcal{M}$. +Thus, differential privacy is algorithmic\kat{??}, it ensures that any adversary observing any $\pmb{o}$ cannot conclude with absolute certainty whether or not any individual is included in any $D$. +Its performance is irrelevant to the computational power and auxiliary information available to an adversary observing the outputs of $\mathcal{M}$.\kat{you already said this. Moreover, it is irrelevant to the neighboring datasets and thus does not fit here..} \begin{definition} [Differential privacy] @@ -173,14 +176,17 @@ Its performance is irrelevant to the computational power and auxiliary informati $$\Pr[\mathcal{M}(D) \in O] \leq e^\varepsilon \Pr[\mathcal{M}(D') \in O]$$ \end{definition} -\noindent $\Pr[\cdot]$ denotes the probability of $\mathcal{M}$ generating $\pmb{o}$ as output, from a set of $O \subseteq \mathcal{O}$, when given any version of $D$ as input. -The privacy budget $\varepsilon$ is a positive real number that represents the user-defined privacy goal~\cite{mcsherry2009privacy}. -As the definition implies, $\mathcal{M}$ achieves stronger privacy protection for lower values of $\varepsilon$ since the probabilities of $D$ and $D'$ being true worlds are similar, but the utility of $\pmb{o}$ is reduced since more randomness is introduced by $\mathcal{M}$. +\noindent $\Pr[\cdot]$ denotes the probability of $\mathcal{M}$ generating $\pmb{o}$ \kat{there is no o in the definition above} as output from $O \subseteq \mathcal{O}$, when given $D$ as input. +The \emph{privacy budget} $\varepsilon$ is a positive real number that represents the user-defined privacy goal~\cite{mcsherry2009privacy}. +As the definition implies, $\mathcal{M}$ achieves stronger privacy protection for lower values of $\varepsilon$ since the probabilities of $D$ and $D'$ being true worlds are similar, but the utility of $\pmb{o}$ \kat{there is no o in the definition above} is reduced since more randomness is introduced by $\mathcal{M}$. The privacy budget $\varepsilon$ is usually set to $0.01$, $0.1$, or, in some cases, $\ln2$ or $\ln3$~\cite{lee2011much}. -Its local variant~\cite{duchi2013local} is compatible with microdata, where $D$ is composed of a single data item and is represented by $x$. +Its local variant~\cite{duchi2013local} is compatible with microdata, where $D$ is composed of a single data item and is represented by $x$.\kat{Seems out of place and needs to be described a little more..} + We refer the interested reader to~\cite{desfontaines2020sok} for a systematic taxonomy of the different variants and extensions of differential privacy. +\kat{introduce and link to the previous text the following definition } + \begin{definition} [Query function sensitivity] \label{def:qry-sens} @@ -188,16 +194,16 @@ We refer the interested reader to~\cite{desfontaines2020sok} for a systematic ta $$\Delta f = \max_{D, D' \in \mathcal{D}} \lVert {f(D) - f(D')} \rVert_{1}$$ \end{definition} -The pertinence of differential privacy methods is inseparable from the query's function sensitivity. -The presence/absence of a single record can only change the result slightly, and therefore differential privacy methods are best for low sensitivity queries such as counts. -However, sum and max queries can be problematic since a single (very different) value could change the output noticeably, making it necessary to add a lot of noise to the query's answer. -Furthermore, asking a series of queries may allow the disambiguation between possible data sets, making it necessary to add even more noise to the outputs. -For this reason, after a series of queries exhausts the available privacy budget the data set has to be discarded. -Keeping the original guarantee across multiple queries that require different/new answers requires the injection of noise proportional to the number of the executed queries, and thus destroying the utility of the output. +The pertinence \kat{pertinence to what?} of differential privacy methods is inseparable from the query's \kat{here, you need to associate a mechanism M to the query, because so far you have been talking for mechanisms} function sensitivity. +The presence/absence of a single record can only change the result slightly\kat{do you want to say 'should' and not 'can'?}, and therefore differential privacy methods are best for low sensitivity queries such as counts. +However, sum and max \kat{and average } queries can be problematic since a single (but outlier) value could change the output noticeably, making it necessary to add a lot of noise to the query's answer. +\kat{How does the following connects to the query's sensitivity?}Furthermore, asking a series of queries may allow the disambiguation between possible data sets, making it necessary to add even more noise to the outputs. +For this reason, after a series of queries exhausts the available privacy budget \kat{you have not talked about the sequential theorem, so this comes out of the blue} the data set has to be discarded. +\kat{THe following is an explanation of the previous. When you restate sth in different words for explanation, please say that you do so, otherwise it is not clear what new you want to convey.}Keeping the original guarantee across multiple queries that require different/new answers requires the injection of noise proportional to the number of the executed queries, and thus destroying the utility of the output. \paragraph{Privacy mechanisms} \label{subsec:prv-mech} -A typical example of differential privacy mechanism is the \emph{Laplace mechanism}~\cite{dwork2014algorithmic}. +A typical example of a differential privacy mechanism is the \emph{Laplace mechanism}~\cite{dwork2014algorithmic}. It draws randomly a value from the probability distribution of $\textrm{Laplace}(\mu, b)$, where $\mu$ stands for the location parameter and $b > 0$ the scale parameter (Figure~\ref{fig:laplace}). Here, $\mu$ is equal to the original output value of a query function, and $b$ is the sensitivity of the query function divided by $\varepsilon$. The Laplace mechanism works for any function with range the set of real numbers.