the-last-thing/text/preliminaries/privacy.tex

508 lines
41 KiB
TeX
Raw Permalink Normal View History

2021-07-27 08:19:09 +02:00
\section{Data privacy}
2021-07-17 16:53:10 +02:00
\label{sec:privacy}
2021-10-24 13:23:19 +02:00
In this section we first study the notion of information disclosure and focus on the privacy attacks that can lead to it.
Furthermore, we investigate the possible privacy protection levels in continuous data publishing.
Finally, we identify the most common privacy operations and the seminal works for privacy-preserving data publishing.
2021-07-17 02:09:47 +02:00
2021-08-02 22:17:19 +02:00
\subsection{Information disclosure}
\label{subsec:prv-info-dscl}
2021-09-05 10:12:01 +02:00
When personal data are publicly released, either as microdata or statistical data, individuals' privacy can be compromised, i.e,~an adversary becomes certain about an individual's \emph{sensitive attribute}, i.e.,~personal information, with a probability higher than a desired threshold.
2021-09-05 23:44:33 +02:00
In the literature, this incident
% compromise
% \kat{do you want to say 'peril', 'risk' instead of compromise ?}
% \mk{No, it's more about the result, i.e., compromising privacy, rather than the risk.}
2021-10-24 13:23:19 +02:00
is known as \emph{information disclosure} and is usually categorized~\cite{li2007t, wang2010privacy, narayanan2008robust} as:
% \emph{presence}, \emph{identity}, or \emph{attribute} disclosure.
2021-07-17 02:09:47 +02:00
\begin{itemize}
2021-10-24 13:23:19 +02:00
\item \emph{Presence disclosure} takes place when the participation or absence of an individual in a data set is revealed.
\item \emph{Identity disclosure} links an individual to a particular record.
\item \emph{Attribute disclosure} reveals information (attribute value) about an individual.
2021-07-17 02:09:47 +02:00
\end{itemize}
In the literature, identity disclosure is also referred to as \emph{record linkage}, and presence disclosure as \emph{table linkage}.
Notice that identity disclosure can result in attribute disclosure, and vice versa.
2021-10-22 03:30:32 +02:00
To better illustrate these definitions, we provide some examples based on Figure~\ref{fig:snapshot}.
2022-01-07 04:08:20 +01:00
Presence disclosure appears when by looking at the (privacy-protected) counts of Figure~\ref{tab:snapshot-statistical}, we can guess if Quackmore has participated in Figure~\ref{tab:snapshot-micro}.
Identity disclosure appears when we can guess that the sixth record of (a privacy-protected version of) the microdata of Figure~\ref{tab:snapshot-micro} belongs to Quackmore.
Attribute disclosure appears when it is revealed from (a privacy-protected version of) the microdata of Figure~\ref{tab:snapshot-micro} that Quackmore is $62$ years old.
2021-07-17 02:09:47 +02:00
2021-08-02 22:17:19 +02:00
\subsection{Attacks to privacy}
\label{subsec:prv-attacks}
Information disclosure is typically achieved by combining supplementary (background) knowledge with the released data or by setting unrealistic assumptions while designing the privacy-preserving algorithms.
In its general form, this is known as \emph{adversarial} or \emph{linkage} attack.
2021-08-09 12:45:35 +02:00
Even though many works directly refer to the general category of linkage attacks, we distinguish also the following sub-categories:
2021-08-02 22:17:19 +02:00
2021-08-03 23:24:24 +02:00
\begin{itemize}
2021-09-05 10:12:01 +02:00
\item \emph{Sensitive attribute domain knowledge}
% \kat{sensitive attribute not defined}
% \mk{Done in subsec:prv-info-dscl}
can result in \emph{homogeneity and skewness} attacks~\cite{machanavajjhala2006diversity,li2007t}, when statistics of the sensitive attribute values are available, and \emph{similarity attack}, when semantics of the sensitive attribute values are available.
\item \emph{Complementary release attacks}~\cite{sweeney2002k} take place when attackers take into account previous releases of different versions of the same and/or related data sets.
% \kat{please rewrite as a full sentence}
2021-08-03 23:24:24 +02:00
In this category, we also identify the \emph{unsorted matching} attack~\cite{sweeney2002k}, which is achieved when two privacy-protected versions of an original data set are published in the same tuple ordering.
2021-09-05 10:12:01 +02:00
Other instances include: (i)~the \emph{join} attack~\cite{wang2006anonymizing}, when tuples can be identified by joining (on the non uniquely identifying attributes, i.e.,~\emph{quasi-identifiers})
% \kat{not defined}
several releases, (ii)~the \emph{tuple correspondence} attack~\cite{fung2008anonymity}, when in case of incremental data certain tuples correspond to certain tuples in other releases, in an injective way, (iii)~the \emph{tuple equivalence} attack~\cite{he2011preventing}, when tuples among different releases are found to be equivalent with respect to the sensitive attribute, and (iv)~the \emph{unknown releases} attack~\cite{shmueli2015privacy}, when the privacy preservation is performed without taking into account previous data releases.
% knowing the previously privacy-protected data sets.
% \kat{can you elaborate on the last one?}
\item \emph{Data correlation}
2021-09-05 10:12:01 +02:00
% \kat{please rewrite as a full sentence}
that may exist
either within one data set or among one data set and previous data releases, and/or other external sources~\cite{kifer2011no, chen2014correlated, liu2016dependence, zhao2017dependent}.
2021-08-03 23:24:24 +02:00
We will look into this category in more detail later in Section~\ref{sec:correlation}.
\end{itemize}
2021-08-02 22:17:19 +02:00
2021-08-09 12:45:35 +02:00
The first sub-category of attacks has been mainly addressed in works on snapshot microdata publishing, but is also present in continuous publishing; however, algorithms for continuous publishing typically accept the proposed solutions for the snapshot publishing scheme (see discussion over $k$-anonymity and $l$-diversity in Section~\ref{subsec:prv-seminal}).
2021-08-02 22:17:19 +02:00
This kind of attacks is tightly coupled with publishing the (privacy-protected) sensitive attribute value.
2022-01-07 04:08:20 +01:00
An example is the lack of diversity in the sensitive attribute domain, e.g.,~if all users in the data set of Figure~\ref{tab:snapshot-micro} had \emph{running} as their Status (the sensitive attribute).
2021-08-09 12:45:35 +02:00
The second and third subcategories are attacks emerging (mostly) in continuous publishing scenarios.
2022-01-07 04:08:20 +01:00
Consider again the data set in Figure~\ref{tab:snapshot-micro}.
2021-08-02 22:17:19 +02:00
The complementary release attack means that an adversary can learn more things about the individuals (e.g.,~that there are high chances that Donald was at work) if he/she combines the information of two privacy-protected versions of this data set.
By the data correlation attack, the status of Donald could be more certainly inferred, by taking into account the status of Dewey at the same moment and the dependencies between Donald's and Dewey's status, e.g.,~when Dewey is at home, then most probably Donald is at work.
2021-08-02 22:17:19 +02:00
In order to better protect the privacy of Donald in case of attacks, the data should be privacy-protected in a more adequate way (than without the attacks).
2021-07-27 08:19:09 +02:00
\subsection{Levels of privacy protection}
2021-07-18 22:59:07 +02:00
\label{subsec:prv-levels}
2021-07-17 02:09:47 +02:00
2021-09-03 14:02:57 +02:00
% The information disclosure that a data release may entail is linked to the protection level that indicates \emph{what} a privacy-preserving algorithm is trying to achieve.
% \kat{I don't understand this first sentence}
% \mk{Same here...}
% More specifically, i
In continuous data publishing we consider the privacy protection level with respect to not only the users, but also to the \emph{events} occurring in the data.
2021-07-18 22:59:07 +02:00
An event is a pair of an identifying attribute of an individual and the sensitive data (including contextual information) and we can see it as a correspondence to a record in a database, where each individual may participate once.
2021-10-22 03:30:32 +02:00
Data publishers typically release events in the form of sequences of data items, usually indexed in time order (time series) and geotagged, e.g.,~(`Dewey', `at home at Montmartre at $t_1$'), \dots, (`Quackmore', `dining at Opéra at $t_1$').
2021-07-18 22:59:07 +02:00
We use the term `users' to refer to the \emph{individuals}, also known as \emph{participants}, who are the source of the processed and published data.
2021-07-17 02:09:47 +02:00
Therefore, they should not be confused with the consumers of the released data sets.
Users are subject to privacy attacks, and thus are the main point of interest of privacy protection mechanisms.
2021-10-24 13:23:19 +02:00
The possible privacy protection levels are:
% the \emph{event}~\cite{dwork2010differential, dwork2010pan}, \emph{user}~\cite{dwork2010differential, dwork2010pan}, and \emph{$w$-event}~\cite{kellaris2014differentially}.
2021-07-17 02:09:47 +02:00
2021-07-18 22:59:07 +02:00
\begin{enumerate}[(a)]
2021-10-22 03:30:32 +02:00
\item \emph{Event-level}~\cite{dwork2010differential, dwork2010pan} limits the privacy protection to \emph{any single event} in a time series, providing high
2021-09-03 14:02:57 +02:00
% \kat{maximum? better say high}
data utility.
2022-01-07 04:08:20 +01:00
\item \emph{User-level}~\cite{dwork2010differential, dwork2010pan} protects \emph{all the events} in a time series, providing high user privacy.
2021-10-22 03:30:32 +02:00
\item \emph{$w$-event-level}~\cite{kellaris2014differentially} provides privacy protection to \emph{any sequence of $w$ events} in a time series.
2021-09-03 14:02:57 +02:00
% \kat{maximum? better say high}
privacy protection.
2021-07-18 22:59:07 +02:00
\end{enumerate}
2021-07-17 02:09:47 +02:00
2021-07-18 22:59:07 +02:00
Figure~\ref{fig:prv-levels} demonstrates the application of the possible protection levels on the statistical data of Example~\ref{ex:continuous}.
2021-10-22 03:30:32 +02:00
For instance, in event-level (Figure~\ref{fig:level-event}) it is hard to determine whether Quackmore was dining at Opéra at $t_1$.
2021-07-17 02:09:47 +02:00
Moreover, in user-level (Figure~\ref{fig:level-user}) it is hard to determine whether Quackmore was ever included in the released series of events at all.
Finally, in $2$-event-level (Figure~\ref{fig:level-w-event}) it is hard to determine whether Quackmore was ever included in the released series of events between the timestamps $t_1$ and $t_2$, $t_2$ and $t_3$, etc. (i.e.,~for a window $w = 2$).
2021-10-22 03:30:32 +02:00
% \kat{Already, by looking at the original counts, for the reader it is hard to see if Quackmore was in the event/database. So, we don't really get the difference among the different levels here.}
% \mk{It is without background knowledge.}
% \kat{But you discuss event and level here by showing just counts, with no background knowledge, and you want the reader to understand how in one case we are not sure if he participated in the event t1 or in any of the events. It is not clear to me what is the difference, just by looking at the example with the counts. }
% \mk{I'll check again later}
2021-07-17 02:09:47 +02:00
\begin{figure}[htp]
\centering
\hspace{\fill}\subcaptionbox{Event-level\label{fig:level-event}}{%
2021-10-10 06:05:41 +02:00
\includegraphics[width=.32\linewidth]{preliminaries/level-event}%
2021-07-17 02:09:47 +02:00
}\hspace{\fill}
\subcaptionbox{User-level\label{fig:level-user}}{%
2021-10-10 06:05:41 +02:00
\includegraphics[width=.32\linewidth]{preliminaries/level-user}%
2021-07-17 02:09:47 +02:00
}\hspace{\fill}
\subcaptionbox{$2$-event-level\label{fig:level-w-event}}{%
2021-10-10 06:05:41 +02:00
\includegraphics[width=.32\linewidth]{preliminaries/level-w-event}%
2021-07-17 02:09:47 +02:00
}\hspace{\fill}
2022-01-07 04:08:20 +01:00
\caption{Protecting the data of Figure~\ref{tab:continuous-statistical} on (a)~event-, (b)~user-, and (c)~$2$-event-level. A suitable distortion method can be applied accordingly.
2021-09-03 14:02:57 +02:00
% \kat{Why don't you distort the results already in this table?}
% \mk{Because we've not discussed yet about these operations.}
}
2021-07-18 22:59:07 +02:00
\label{fig:prv-levels}
2021-07-17 02:09:47 +02:00
\end{figure}
2021-08-09 15:41:30 +02:00
Contrary to event-level, which provides privacy guarantees for a single event, user- and $w$-event-level offer stronger privacy protection by protecting a series of events.
2021-09-14 17:09:41 +02:00
Event- and $w$-event-level better fit scenarios of infinite data observation, whereas user-level is more appropriate when the span of data observation is finite.
2021-07-17 02:09:47 +02:00
$w$-event- is narrower than user-level protection due to its sliding window processing methodology.
2021-08-09 15:41:30 +02:00
In the extreme cases where $w$ is equal either to $1$ or to the length of the time series, $w$-event- matches event- or user-level protection, respectively.
2021-08-09 16:19:19 +02:00
Although the described levels have been coined in the context of \emph{differential privacy}~\cite{dwork2006calibrating}, a seminal privacy method that we will discuss in more detail in Section~\ref{subsec:prv-statistical}, they are used for other privacy protection techniques as well.
2021-07-17 02:09:47 +02:00
2021-07-27 08:19:09 +02:00
\subsection{Privacy-preserving operations}
2021-07-18 22:59:07 +02:00
\label{subsec:prv-operations}
2021-07-17 02:09:47 +02:00
2021-09-14 17:09:41 +02:00
%Protecting private information
% , which is known by many names (obfuscation, cloaking, anonymization, etc.),
% \kat{the techniques are not equivalent, so it is correct to say that they are different names for the same thing}
2021-09-14 17:09:41 +02:00
%is achieved by using a specific basic
% \kat{but later you mention several ones.. so what is the specific basic one ?}
2021-09-14 17:09:41 +02:00
%privacy protection operation.
%Depending on the
%technique
% intervention
% \kat{?, technique, algorithm, method, operation, intervention.. we are a little lost with the terminology and the difference among all these }
2021-09-14 17:09:41 +02:00
%that we choose to perform on the original data,
We identify the following privacy operations that can be applied on the original data to achieve privacy preservation:
% \kat{you can mention that the different operations have different granularity}
% \mk{``granularity''?}
2021-07-17 02:09:47 +02:00
\begin{itemize}
2021-10-22 03:30:32 +02:00
\item \emph{Aggregation} combines
% group
% \kat{or combine? also maybe mention that the single value will replace the values of a specific attribute of these rows}
% together
2021-09-14 17:09:41 +02:00
multiple rows of a data set to form a single value which will replace these rows.
2021-10-22 03:30:32 +02:00
\item \emph{Generalization} replaces an attribute value with a parent value in the attribute taxonomy (when applicable).
% Notice that a step of generalization, may be followed by a step of \emph{specialization}, to improve the quality of the resulting data set.
% \kat{This technical detail is not totally clear at this point. Either elaborate or remove.}
% \mk{I cannot remember coming across it in the literature.}
2021-10-22 03:30:32 +02:00
\item \emph{Suppression} deletes completely certain sensitive values or entire records.
\item \emph{Perturbation} disturbs the initial attribute value in a deterministic or probabilistic way.
2021-07-17 02:09:47 +02:00
The probabilistic data distortion is referred to as \emph{randomization}.
\end{itemize}
For example, consider the table schema \emph{User(Name, Age, Location, Status)}.
2021-10-22 03:30:32 +02:00
If we want to protect the \emph{Age} of the user by aggregation, we may
% replace it by the average age in her Location\kat{This example does not follow the description you give before for aggregation. Indeed, it fits better the perturbation (you replaced the value with the average age of the same location, which is a deterministic process). Don't you mean counts by aggregation? If you mean aggregation as in sql functions then you should not say in the definition that you replace the rows with the aggregate, but a specific attribute's value. }
group the data by Location and report the average Age for each group;
by generalization, we may replace the Age by Age intervals; by suppression we may delete the entire table column corresponding to Age; by perturbation, we may augment each Age by a predefined percentage of the Age; by randomization we may randomly replace each Age by a value taken from the probability density function of the attribute.
2021-07-17 02:09:47 +02:00
It is worth mentioning that there is a series of algorithms (e.g.,~\cite{benaloh2009patient, kamara2010cryptographic, cao2014privacy}) based on the \emph{cryptography} operation.
However, the majority of these methods, among other assumptions that they make, have minimum or even no trust to the entities that handle the personal information.
2021-09-14 17:09:41 +02:00
Furthermore, the amount and the way of data processing of these techniques usually burden the overall procedure, deteriorate the utility of the resulting data sets to a point where they are completely useless, and thus restrict their usage by third-parties.
% \kat{All these points apply also to the non-cryptography techniques. So you should mostly point out that they do not only deteriorate the utility but make them non-usable at all.}
2021-07-17 02:09:47 +02:00
Our focus is limited to techniques that achieve a satisfying balance between both participants' privacy and data utility.
% For these reasons, there will be no further discussion around this family of techniques in this article.
% \kat{sentence that fitted in the survey but not in the thesis so replace with a more pertinent comment}
2021-07-17 02:09:47 +02:00
2021-07-30 19:27:48 +02:00
\subsection{Basic notions for privacy protection}
2021-07-18 22:59:07 +02:00
\label{subsec:prv-seminal}
2021-07-17 02:09:47 +02:00
2021-09-03 12:59:50 +02:00
For completeness, in this section we present the seminal works for privacy-preserving data publishing, which, even though originally designed for the snapshot publishing scenario,
% \kat{was dp designed for the snapshot publishing scenario?}
% \mk{Not clearly but yes. We can write it since DP was coined in 2006, while DP under continual observation came later in 2010.}
2021-09-15 12:37:22 +02:00
have paved the way
%, since many
of privacy-preserving continuous publishing as well.
% are based on or extend them.
2021-07-17 02:09:47 +02:00
2021-07-17 16:53:10 +02:00
\subsubsection{Microdata}
2021-07-18 22:59:07 +02:00
\label{subsec:prv-micro}
2021-07-17 02:09:47 +02:00
Sweeney coined \emph{$k$-anonymity}~\cite{sweeney2002k}, one of the first established works on data privacy.
2021-09-15 12:37:22 +02:00
A released data set features $k$-anonymity protection when the values of a set of identifying attributes, called the \emph{quasi-identifiers}, is the same for at least $k$ records in the data set.
2021-10-22 03:30:32 +02:00
% Computing the quasi-identifiers in a set of attributes is still a hard problem on its own~\cite{motwani2007efficient}.
% \kat{yes indeed, but seems out of context here.}
2021-09-03 12:56:22 +02:00
% $k$-anonymity
% is syntactic,
% \kat{meaning?}
% it
% constitutes an individual indistinguishable from at least $k-1$ other individuals in the same data set.
% \kat{you just said this in another way,two sentences before}
2021-07-17 02:09:47 +02:00
In a follow-up work~\cite{sweeney2002achieving}, the author describes a way to achieve $k$-anonymity for a data set by the suppression or generalization of certain values of the quasi-identifiers.
2021-08-13 12:17:53 +02:00
Several works identified and addressed privacy concerns on $k$-anonymity. Machanavajjhala et al.~\cite{machanavajjhala2006diversity} pointed out that $k$-anonymity is vulnerable to homogeneity and background knowledge attacks.
2021-07-17 02:09:47 +02:00
Thereby, they proposed \emph{$l$-diversity}, which demands that the values of the sensitive attributes are `well-represented' by $l$ sensitive values in each group.
Principally, a data set can be $l$-diverse by featuring at least $l$ distinct values for the sensitive field in each group (\emph{distinct} $l$-diversity).
Other instantiations demand that the entropy of the whole data set is greater than or equal to $\log(l)$ (\emph{entropy} $l$-diversity) or that the number of appearances of the most common sensitive value is less than the sum of the counts of the rest of the values multiplied by a user defined constant $c$ (\emph{recursive (c, l)}-diversity).
Later on, Li et al.~\cite{li2007t} indicated that $l$-diversity can be void by skewness and similarity attacks due to sensitive attributes with a small value range.
In such cases, \emph{$\theta$-closeness} guarantees that the distribution of a sensitive attribute in a group and the distribution of the same attribute in the whole data set is `similar'.
2021-08-13 12:17:53 +02:00
This similarity is bound by a threshold $\theta$.
A data set features $\theta$-closeness when all of its groups satisfy $\theta$-closeness.
2021-07-17 02:09:47 +02:00
The main drawback of $k$-anonymity (and its derivatives) is that it is not tolerant to external attacks of re-identification on the released data set.
The problems identified in~\cite{sweeney2002k} appear when attempting to apply $k$-anonymity on continuous data publishing (as we will also see next in Section~\ref{sec:micro}).
These attacks include multiple $k$-anonymous data set releases with the same record order, subsequent releases of a data set without taking into account previous $k$-anonymous releases, and tuple updates.
2021-09-03 12:56:22 +02:00
Proposed solutions include rearranging the attributes, setting the whole attribute set of previously released data sets as quasi-identifiers or releasing data based on previous $k$-anonymous releases~\cite{simi2017extensive}.
% \kat{and the citations of these solutions?}
2021-07-17 02:09:47 +02:00
2021-07-17 16:53:10 +02:00
\subsubsection{Statistical data}
2021-07-18 22:59:07 +02:00
\label{subsec:prv-statistical}
2021-07-17 02:09:47 +02:00
While methods based on $k$-anonymity have been mainly employed for releasing microdata, \emph{differential privacy}~\cite{dwork2006calibrating} has been proposed for releasing high utility aggregates over microdata while providing semantic
2021-10-22 03:30:32 +02:00
% \kat{semantic ?}
% \mk{Yes, explainted by the following}
privacy guarantees that characterize the output data.
Differential privacy is algorithmic,
% \kat{algorithmic? moreover, you repeat this sentence later on, after the definition of neighboring datasets}
2021-09-15 12:37:22 +02:00
it characterizes the data publishing process, which passes its privacy guarantee to the resulting data.
It ensures that any adversary observing a privacy-protected output, no matter their computational power or auxiliary information, cannot conclude with absolute certainty if an individual is included in the input data set (Definition~\ref{def:nb-d-s}).
\begin{definition}
[Neighboring data sets~\cite{dwork2006calibrating}]
2021-09-15 12:37:22 +02:00
\label{def:nb-d-s}
Two data sets are neighboring (or adjacent) when they differ by at most one tuple, i.e.,~one can be obtained by adding/removing the data of an individual to/from the other.
\end{definition}
Moreover, differential privacy quantifies and bounds the impact that the addition/removal of an individual to/from a data set has on the derived privacy-protected aggregates thereof.
2021-10-22 03:30:32 +02:00
More precisely, differential privacy quantifies the impact of the addition/removal of a single tuple in $D$ on the output $\pmb{o}$ of a privacy mechanism $\mathcal{M}$ that perturbs the result of a query function $f$.
% \kat{what is M?}
The distribution of all $\pmb{o}$, in some range $\mathcal{O}$, is not affected \emph{substantially}, i.e.,~it changes only slightly due to the modification of any one tuple in all possible $D \in \mathcal{D}$.
2021-09-15 12:37:22 +02:00
Formally, differential privacy is given in Definition~\ref{def:dp}.
2021-08-13 12:17:53 +02:00
% \kat{introduce the following definition, and link it to the text before. Maybe you can put the definition after the following paragraph.}
2021-07-17 02:09:47 +02:00
2021-09-15 12:37:22 +02:00
2021-07-18 22:59:07 +02:00
% Thus, differential privacy
% is algorithmic,
% \kat{??}
% it
% ensures that any adversary observing any $\pmb{o}$ cannot conclude with absolute certainty whether or not any individual is included in any $D$.
% Its performance is irrelevant to the computational power and auxiliary information available to an adversary observing the outputs of $\mathcal{M}$.
% \kat{you already said this. Moreover, it is irrelevant to the neighboring datasets and thus does not fit here..}
2021-10-22 03:30:32 +02:00
% \kat{Say what is a mechanism and how it is connected to the query, what are their differences? In the next section that you speak about the examples, we are still not sure about what is a mechanism in general.}
2021-07-18 22:59:07 +02:00
\begin{definition}
[Differential privacy~\cite{dwork2006calibrating}]
2021-07-18 22:59:07 +02:00
\label{def:dp}
A privacy mechanism $\mathcal{M}$, with domain $\mathcal{D}$ and range $\mathcal{O}$, satisfies $\varepsilon$-differential privacy, for a given privacy budget $\varepsilon$, if for every pair of neighboring data sets $D, D' \in \mathcal{D}$ and all sets $O \subseteq \mathcal{O}$:
$$\Pr[\mathcal{M}(D) \in O] \leq e^\varepsilon \Pr[\mathcal{M}(D') \in O]$$
\end{definition}
\noindent $\Pr[\cdot]$ denotes the probability of $\mathcal{M}$ generating an output
% $\pmb{o}$
% \kat{there is no o in the definition above}
% as output
2021-09-15 12:37:22 +02:00
from $O \subseteq \mathcal{O}$, when given $D$ as input.
2021-08-13 12:17:53 +02:00
The \emph{privacy budget} $\varepsilon$ is a positive real number that represents the user-defined privacy goal~\cite{mcsherry2009privacy}.
2021-09-15 12:37:22 +02:00
As the definition implies, $\mathcal{M}$ achieves stronger privacy protection for lower values of $\varepsilon$ since the probabilities of $D$ and $D'$ being true worlds are similar, but the utility of the output
% $\pmb{o}$
% \kat{there is no o in the definition above}
is reduced since more randomness is introduced by $\mathcal{M}$.
2021-07-18 22:59:07 +02:00
The privacy budget $\varepsilon$ is usually set to $0.01$, $0.1$, or, in some cases, $\ln2$ or $\ln3$~\cite{lee2011much}.
% Its local variant~\cite{duchi2013local} is compatible with microdata, where $D$ is composed of a single data item and is represented by $x$.\kat{Seems out of place and needs to be described a little more..}
% We refer the interested reader to~\cite{desfontaines2020sok} for a systematic taxonomy of the different variants and extensions of differential privacy.
2021-08-13 12:17:53 +02:00
The applicability
% pertinence
% \kat{pertinence to what?}
of differential privacy mechanisms is inseparable from the query's
% \kat{here, you need to associate a mechanism M to the query, because so far you have been talking for mechanisms}
function sensitivity.
The presence/absence of a single record should only change the result slightly,
% \kat{do you want to say 'should' and not 'can'?}
2021-09-15 12:37:22 +02:00
and therefore differential privacy methods are best for low sensitivity queries (see Definition~\ref{def:qry-sens}) such as counts.
However, sum, max, and in some cases average
% \kat{and average }
2021-09-15 12:37:22 +02:00
queries can be problematic, since a single, outlier value could change the output noticeably, making it necessary to add a lot of noise to the query's answer.
% \kat{introduce and link to the previous text the following definition }
2021-08-13 12:17:53 +02:00
2021-07-18 22:59:07 +02:00
\begin{definition}
[Query function sensitivity~\cite{dwork2006calibrating}]
2021-07-18 22:59:07 +02:00
\label{def:qry-sens}
The sensitivity of a query function $f$ for all neighboring data sets $D, D' \in \mathcal{D}$ is:
$$\Delta f = \max_{D, D' \in \mathcal{D}} \lVert {f(D) - f(D')} \rVert_{1}$$
\end{definition}
2021-10-22 03:30:32 +02:00
The notion of differential privacy has highly influenced the research community, resulting in many follow-up publications (\cite{mcsherry2007mechanism, kifer2011no, zhang2017privbayes} to mention a few).
We distinguish here \emph{Pufferfish}~\cite{kifer2014pufferfish}.
\emph{Pufferfish} is a framework that allows experts in an application domain, without necessarily having any particular expertise in privacy, to develop privacy definitions for their data sharing needs.
To define a privacy mechanism using \emph{Pufferfish}, one has to define a set of potential secrets $\mathcal{X}$, a set of distinct pairs $\mathcal{X}_{pairs}$, and auxiliary information about data evolution scenarios $\mathcal{B}$.
$\mathcal{X}$ serves as an explicit specification of what we would like to protect, e.g.,~`the record of an individual $x$ is (not) in the data'.
$\mathcal{X}_{pairs}$ is a subset of $\mathcal{X} \times \mathcal{X}$ that instructs how to protect the potential secrets $\mathcal{X}$, e.g.,~(`$x$ is in the table', `$x$ is not in the table').
Finally, $\mathcal{B}$ is a set of conservative assumptions about how the data evolved (or were generated) that reflects the adversary's belief about the data, e.g.,~probability distributions, variable correlation, etc.
2021-10-22 03:30:32 +02:00
When there is independence between all the records in the original data set, then $\varepsilon$-differential privacy and the privacy definition of $\varepsilon$-\emph{Pufferfish}$(\mathcal{X}, \mathcal{X}_{pairs}, \mathcal{B})$ are equivalent.
2021-07-17 02:09:47 +02:00
2021-09-15 12:37:22 +02:00
\paragraph{Popular privacy mechanisms}
2021-07-18 22:59:07 +02:00
\label{subsec:prv-mech}
2021-10-09 12:09:59 +02:00
2021-08-13 12:17:53 +02:00
A typical example of a differential privacy mechanism is the \emph{Laplace mechanism}~\cite{dwork2014algorithmic}.
It draws randomly a value from the probability distribution of $\textrm{Laplace}(\mu, b)$, where $\mu$ stands for the location parameter and $b > 0$ is the scale parameter (Figure~\ref{fig:mech-lap}).
In our case, $\mu$ is equal to the original output value of a query function, and $b$ is the sensitivity of the query function divided by the privacy budget $\varepsilon$.
2021-07-17 02:09:47 +02:00
The Laplace mechanism works for any function with range the set of real numbers.
2021-07-18 22:59:07 +02:00
\begin{figure}[htp]
\centering
\includegraphics[width=.5\linewidth]{preliminaries/mech-lap}
\caption{A Laplace distribution for location $\mu = 0$ and different scale values $b$.}
\label{fig:mech-lap}
2021-07-18 22:59:07 +02:00
\end{figure}
2021-10-22 03:30:32 +02:00
A specialization of this mechanism for location data is the \emph{Planar Laplace mechanism}~\cite{andres2013geo,chatzikokolakis2015geo},
% which is based on a multivariate Laplace distribution.
% \emph{Geo-indistinguishability} is
an adaptation of differential privacy for location data in snapshot publishing (\emph{Geo-indistinguishability}).
It is based on $l$-privacy, which offers to individuals within an area with radius $r$ a privacy level of $l$ (Figure~\ref{fig:mech-planar-lap}).
2021-10-22 03:30:32 +02:00
More specifically, $l$ is equal to $\varepsilon r$ if any two locations within distance $r$ provide data with similar distributions.
This similarity depends on $r$ because the closer two locations are, the more likely they are to share the same features.
Intuitively, the definition implies that if an adversary learns the published location for an individual, the adversary cannot infer the individual's true location, out of all the points in a radius $r$, with a certainty higher than a factor depending on $l$.
The technique adds random noise drawn from a multivariate Laplace distribution to individuals' locations, while taking into account spatial boundaries and features.
\begin{figure}[htp]
\centering
\includegraphics[width=.5\linewidth]{preliminaries/mech-planar-lap}
\caption{Geo-indistinguishability: privacy level $l$ varying with the protection radius $r$.}
\label{fig:mech-planar-lap}
\end{figure}
2022-01-15 03:36:45 +01:00
For query functions that do not return a real number, e.g.,~`What is the most visited country this year?', or in cases where perturbing the value of the output will completely destroy its utility, e.g.,~`How many patients in the ICU?', most works in the literature use the \emph{Exponential mechanism}~\cite{mcsherry2007mechanism}.
Initially, a utility function $u$, with sensitivity $\Delta u$, maps pairs of the input value $x$ and output value $r$ to utility scores.
2022-01-15 03:36:45 +01:00
Thereafter, the mechanism $M$ selects an output value $r$ from a set of possible outputs $R$ with probability proportional to $\exp(\frac{\varepsilon u(x, r)}{2\Delta u})$.
% $\Delta u$is the sensitivity of the utility
2021-10-22 03:30:32 +02:00
% \kat{what is the utility function?}
% \mk{Already explained}
% function.
\begin{figure}[htp]
\centering
\includegraphics[width=.5\linewidth]{preliminaries/mech-exp}
\caption{The internal mechanics of the exponential mechanism.}
\label{fig:mech-exp}
\end{figure}
2021-07-18 22:59:07 +02:00
2021-07-17 02:09:47 +02:00
Another technique for differential privacy mechanisms is the \emph{randomized response}~\cite{warner1965randomized}.
It is a privacy-preserving survey method that introduces probabilistic noise to the statistics of a research by randomly instructing respondents to answer truthfully or `Yes' to a sensitive, binary question.
The technique achieves this randomization by including a random event, e.g.,~the flip of a fair coin.
The respondents reveal to the interviewers only their answer to the question, and keep as a secret the result of the random event (i.e.,~if the coin was tails or heads).
Thereafter, the interviewers can calculate the probability distribution of the random event, e.g.,~$\frac{1}{2}$ heads and $\frac{1}{2}$ tails, and thus they can roughly eliminate the false responses and estimate the final result of the research.
Based on this methodology, the \emph{Random response} mechanism~\cite{wang2010privacy} returns the true or flipped answer value $x$ with a probability $p$ proportional to the privacy budget $\varepsilon$ (Figure~\ref{fig:mech-rnd-resp}).
% $\frac{e^\varepsilon}{1 + e^\varepsilon}$
2021-07-17 02:09:47 +02:00
2021-10-22 03:30:32 +02:00
% \kat{is the following two paragraphs still part of the examples of privacy mechanisms? I am little confused here.. if the section is not only for examples, then you should introduce it somehow (and not start directly by saying 'A typical example...')}
2021-09-15 12:37:22 +02:00
\begin{figure}[htp]
\centering
\includegraphics[width=.3\linewidth]{preliminaries/mech-rnd-resp}
\caption{The internal mechanics of the random response mechanism.}
\label{fig:mech-rnd-resp}
\end{figure}
2021-10-22 03:30:32 +02:00
A special category of differential privacy-preserving
% algorithms
% \kat{algorithms? why not mechanisms ?}
mechanisms
is that of \emph{pan-private} algorithms~\cite{dwork2010pan}.
2021-07-17 02:09:47 +02:00
Pan-private algorithms hold their privacy guarantees even when snapshots of their internal state (memory) are accessed during their execution by an external entity, e.g.,~subpena, security breach, etc.
There are two intrusion types that a data publisher has to deal with when designing a pan-private mechanism: \emph{single unannounced}, and \emph{continual announced} intrusion.
In the first, the data publisher assumes that the mechanism's state is observed by the external entity one unique time, without the data publisher ever being notified about it.
In the latter, the external entity gains access to the mechanism's state multiple times, and the publisher is notified after each time.
The simplest approach to deal with both cases is to make sure that the data in the memory of the mechanism have constantly the same distribution, i.e.,~they are differentially private.
2021-10-22 03:30:32 +02:00
Notice that this must hold throughout the mechanism's lifetime, even before/after it processes any sensitive data item(s).
% \kat{what do you mean here? even if it processes non-sensitive items before or after?}
% \mk{Yes}
% \kat{The way you start this paragraph is more suited for the related work. If you want to present Pufferfish as a background knowledge, do it directly. But in my opinion, since you do not use it for your work, there is no meaning for putting this in your background section. Mentioning it in the related work is sufficient. Same for geo-indistinguishability. }
2021-07-17 02:09:47 +02:00
2021-09-15 12:37:22 +02:00
\bigskip
In what follows, we present some primordial properties of differential private mechanisms that rule their composition and post processing.
2021-07-18 22:59:07 +02:00
\paragraph{Composition}
\label{subsec:compo}
Mechanisms that satisfy differential privacy are \emph{composable}, i.e.,~the combination of their results satisfy differential privacy as well.
In this section, we provide an overview of the most prominent composition theorems that instruct data publishers \emph{how} to estimate the overall privacy protection when utilizing a series of differential privacy mechanisms.
\begin{theorem}
[Composition~\cite{mcsherry2009privacy}]
2021-07-18 22:59:07 +02:00
\label{theor:compo}
Any combination of a set of independent differential privacy mechanisms satisfying a corresponding set of privacy guarantees shall satisfy differential privacy as well, i.e.,~provide a differentially private output.
\end{theorem}
Generally, when we apply a series of independent (i.e.,~in the way that they inject noise) differential privacy mechanisms on independent data, we can calculate the privacy level of the resulting output according to the \emph{sequential} composition property~\cite{mcsherry2009privacy, soria2016big}.
\begin{theorem}
[Sequential composition on independent data~\cite{mcsherry2009privacy}]
2021-07-18 22:59:07 +02:00
\label{theor:compo-seq-ind}
The privacy guarantee of $m \in \mathbb{Z}^+$ independent privacy mechanisms, satisfying $\varepsilon_1$-, $\varepsilon_2$-, \dots, $\varepsilon_m$-differential privacy respectively, when applied over the same data set equals to $\sum_{i = 1}^m \varepsilon_i$.
\end{theorem}
% \kat{How does the following connects to the query's sensitivity?}
Asking a series of queries may allow the disambiguation between possible data sets, making it necessary to add even more noise to the outputs.
% \kat{The following is an explanation of the previous. When you restate sth in different words for explanation, please say that you do so, otherwise it is not clear what new you want to convey.}
Keeping the original guarantee across multiple queries that require different/new answers requires the injection of noise proportional to the number of the executed queries, and thus destroying the utility of the output.
For this reason, after a series of queries exhausts the available privacy budget
% \kat{you have not talked about the sequential theorem, so this comes out of the blue}
the data set has to be discarded.
2021-07-18 22:59:07 +02:00
Notice that the sequential composition corresponds to the worst case scenario where each time we use a mechanism we have to invest some (or all) of the available privacy budget.
In the special case that we query disjoint data sets, we can take advantage of the \emph{parallel} composition property~\cite{mcsherry2009privacy, soria2016big}, and thus spare some of the available privacy budget.
\begin{theorem}
[Parallel composition on independent data~\cite{mcsherry2009privacy}]
2021-07-18 22:59:07 +02:00
\label{theor:compo-par-ind}
When $m \in \mathbb{Z}^+$ independent privacy mechanisms, satisfying $\varepsilon_1$-, $\varepsilon_2$-,\dots, $\varepsilon_m$-differential privacy respectively, are applied over disjoint independent subsets of a data set, they provide a privacy guarantee equal to $\max_{i \in [1, m]} \varepsilon_i$.
\end{theorem}
2021-10-19 03:43:57 +02:00
When the users consider recent data releases more privacy-sensitive than distant ones, we estimate the overall privacy loss in a time fading manner according to a temporal discounting function, e.g.,~exponential, hyperbolic,~\cite{farokhi2020temporally}.
2021-07-18 22:59:07 +02:00
\begin{theorem}
[Sequential composition with temporal discounting~\cite{farokhi2020temporally}]
2021-07-18 22:59:07 +02:00
\label{theor:compo-seq-disc}
A set of $m \in \mathbb{Z}^+$ independent privacy mechanisms, satisfying $\varepsilon_1$-, $\varepsilon_2$-,\dots, $\varepsilon_m$-differential privacy respectively, satisfy $\sum_{i = 1}^m g(i) \varepsilon_i$ differential privacy for a discount function $g$.
\end{theorem}
When dealing with temporally correlated data, we handle a sequence of $w \leq t \in \mathbb{Z}^+$ mechanisms (indexed by $m \in [1, t]$) as a single entity where each mechanism contributes to the temporal privacy loss depending on its order of application~\cite{cao2017quantifying}.
2021-09-15 12:37:22 +02:00
The first ($m - 1$ if $w \leq 2$ or $m - w + 1$ if $w > 2$) and last ($m$) mechanisms contribute to the backward and forward temporal privacy loss respectively (see also Section~\ref{subsec:cor-temp}).
2021-07-18 22:59:07 +02:00
When $w$ is greater than $2$, the rest of the mechanisms (between $m - w + 2$ and $m - 1$) contribute only to the privacy loss that is corresponding to the publication of the relevant data.
\begin{theorem}
[Sequential composition under temporal correlation~\cite{cao2018quantifying}]
2021-07-18 22:59:07 +02:00
\label{theor:compo-seq-cor}
When a set of $w \leq t \in \mathbb{Z}^+$ independent privacy mechanisms, satisfying $\varepsilon_{m \in [1, t]}$-differential privacy, is applied over a sequence of an equal number of temporally correlated data sets, it provides a privacy guarantee equal to:
$$
\begin{cases}
\alpha^B_{m - 1} + \alpha^F_m & \quad w \leq 2 \\
\alpha^B_{m - w + 1} + \alpha^F_m + \sum_{i = m - w + 2}^{m - 1} \varepsilon_i & \quad w > 2
\end{cases}
$$
\end{theorem}
Notice that the estimation of forward privacy loss is only pertinent to a setting under finite observation and moderate correlation.
2021-07-18 22:59:07 +02:00
In different circumstances, it might be impossible to calculate the upper bound of the temporal privacy loss, and thus only the backward privacy loss would be relevant.
% Notice that---although we refer to it as `sequential'---since Theorem~\ref{theor:compo-seq-cor} refers to the application of a sequence of mechanisms to a respective sequence of disjoint data sets, we would normally expect it to correspond to the parallel composition on independent data (Theorem~\ref{theor:compo-par-ind}).
% However, due to the temporal correlations, the data sets are considered as one single data set; therefore, the application of a sequence of mechanisms can be handled according to the sequential composition on independent data (Theorem~\ref{theor:compo-seq-ind}).
\paragraph{Post-processing}
\label{subsec:p-proc}
Every time a data publisher interacts with (any part of) the original data set, it is mandatory to consume some of the available privacy budget according to the composition theorems~\ref{theor:compo-seq-ind} and~\ref{theor:compo-par-ind}.
However, the \emph{post-processing} of a perturbed data set can be done without using any additional privacy budget.
\begin{theorem}
[Post-processing~\cite{mcsherry2009privacy}]
\label{theor:p-proc}
2021-07-18 22:59:07 +02:00
The post-processing of any output of an $\varepsilon$-differential privacy mechanism shall not deteriorate its privacy guarantee.
\end{theorem}
2021-10-22 03:30:32 +02:00
Naturally, using the same (or different) privacy mechanism(s) multiple times to interact with raw data in combination with already perturbed data implies that the privacy guarantee of the final output will be calculated according to Theorem~\ref{theor:compo-seq-ind}.
% \kat{can you be more explicit here? Do you mean that only the consumption of budget on the raw data will be taken into account? And that the queries over the results do not count?}
That is, we add up the privacy budgets attributed to the outputs from previous mechanism applications with the current privacy budget.
2021-07-18 22:59:07 +02:00
2021-07-17 02:09:47 +02:00
\begin{example}
\label{ex:application}
To illustrate the usage of the microdata and statistical data techniques for privacy-preserving data publishing, we revisit Example~\ref{ex:continuous}.
In this example, users continuously interact with an LBS by reporting their status at various locations.
Then, the reported data are collected by the central service, in order to be protected and then published, either as a whole, or as statistics thereof.
Notice that in order to showcase the straightforward application of $k$-anonymity and differential privacy, we apply the two methods on each timestamp independently from the previous one, and do not take into account any additional threats imposed by continuity.
2021-10-10 06:05:41 +02:00
\includetable{preliminaries/scenario-micro}
2021-07-17 02:09:47 +02:00
2022-01-07 04:08:20 +01:00
First, we anonymize the data set of Figure~\ref{tab:continuous-micro} using $k$-anonymity, with $k = 3$.
2021-07-17 02:09:47 +02:00
This means that any user should not be distinguished from at least $2$ others.
Status is the sensitive attribute, thus the attribute that we wish to protect.
We start by suppressing the values of the Name attribute, which is the identifier.
The Age and Location attributes are the quasi-identifiers, so we proceed to adequately generalize them.
We turn age values to ranges ($\leq 20$, and $> 20$), and generalize location to city level (Paris).
Finally, we achieve $3$-anonymity by putting the entries in groups of three, according to the quasi-identifiers.
2021-10-22 03:30:32 +02:00
Figure~\ref{fig:scenario-micro} depicts the results at each timestamp.
2021-07-17 02:09:47 +02:00
2021-10-10 06:05:41 +02:00
\includetable{preliminaries/scenario-statistical}
2021-07-17 02:09:47 +02:00
Next, we demonstrate differential privacy.
2022-01-07 04:08:20 +01:00
We apply an $\varepsilon$-differentially private Laplace mechanism, with $\varepsilon = 1$, taking into account the count query that generated the true counts of Figure~\ref{tab:continuous-statistical}.
2021-07-17 02:09:47 +02:00
The sensitivity of a count query is $1$ since the addition/removal of a tuple from the data set can change the final result of the query by maximum $1$ (tuple).
Figure~\ref{fig:laplace} shows how the Laplace distribution for the true count in Montmartre at $t_1$ looks like.
2022-01-07 04:08:20 +01:00
Figure~\ref{tab:statistical-noisy} shows all the perturbed counts that are going to be released.
2021-07-17 02:09:47 +02:00
\end{example}