the-last-thing/background.tex
2019-03-05 19:01:18 +01:00

361 lines
24 KiB
TeX

\chapter{Background}
\label{ch:background}
In this section, we introduce some relevant terminology and background knowledge around the problem of continuous publication of private data sets.
\section{Data}
\label{sec:data}
\subsection{Categories}
\label{subsec:data-categories}
As this survey is about privacy, the data that we are interested in contain information about individuals and their actions, called also \emph{microdata}, in their raw, usually tabular form.
When the data are aggregated, or transformed using a statistical analysis task, we talk about \emph{statistical data}.
Data in either of these two forms, may have a special property called~\emph{continuity}, and data with continuity are further classified into the following three categories:
\begin{itemize}
\item \emph{Data stream} is a possibly \emph{unbounded} series of data.
\item \emph{Sequential data} is a series of \emph{dependent} data, e.g.,~trajectories.
\item \emph{Time series} is a series of data points \emph{indexed in time}.
\end{itemize}
Note that data may fall into more than one of these categories, e.g.,~we may have data streams of dependent data, or dependent data indexed in time.
\subsection{Processing}
\label{subsec:data-processing}
The traditional flow of crowdsourced data, as shown in Figure~\ref{fig:data-flow}, and witnessed in the majority of works that we cover, corresponds to a centralized architecture, i.e.,~data are collected, processed and anonymized, and published by an intermediate (trusted) entity~\cite{mcsherry2009privacy, blocki2013differentially, johnson2018towards}.
On the other side, we are also aware of several solutions proposed to support a decentralized privacy preserving schema~\cite{andres2013geo, erlingsson2014rappor, katsomallos2017open}.
According to this approach, the anonymization process takes place locally, on the side of the data producers, before sending the data to any intermediate entity or, in some cases, directly to the data consumers.
In this way, the resulting anonymization procedure is more independent from third-party entities, avoiding the risk of unpredicted privacy leakage from a compromised data publisher.
Contrary to centralized architectures, most of such multi-party computation approaches fail to generate elaborate data statistics, to offer a complete and up-to-date view of the events happening on the data producers' side, and thus, to maximize the utility of the available data.
For this reason, the centralized paradigm is more popular and hence, in this survey we also focus on works in this area.
A data aggregation or a privacy preservation process is performed in the following two prominent data processing paradigms:
\begin{itemize}
\item \emph{Batch} allows the --- usually offline --- processing of a (large) block of data, collected over a period of time, which could as well be a complete data set.
Batch processing excels at queries that require answers with high accuracy, since decisions are made based on observations on the whole data set.
\item \emph{Streaming} refers to the --- usually online --- processing of a possibly unbounded sequence of data items of small volume (in contrast to batch).
By nature, the processing is performed on the data subsets available at each point in time, and is ideal for time-critical queries that require (near) real-time analytics.
\end{itemize}
\subsection{Publishing}
\label{subsec:data-publishing}
Either raw or processed, data can be subsequently published in one of the following publication methods:
\begin{itemize}
\item \emph{One-shot} is the one-off publication of the (whole) data set at one time point.
\item \emph{Continuous} is the publishing of data sets in an uninterrupted manner.
The continuous publication scheme can be organized as shown below:
\begin{itemize}
\item \emph{Continual} refers to the publishing of data, characterized by some periodicity.
Slightly abusing the terminology, the terms `continuous' and `continual' often appear interchangeably in the literature.
\item \emph{Sequential} refers to the publishing of views of an original data set or updates thereof, the one after the other.
By definition the different releases are related to each other
\begin{itemize}
\item \emph{Incremental} is the sequential publishing of the $\Delta$ (i.e.,~the difference) over the previous data set release.
\end{itemize}
\end{itemize}
\end{itemize}
\section{Privacy}
\label{sec:privacy}
When personal data are publicly released, either as microdata or statistical data, individuals' privacy can be compromised.
In the literature this compromise is know as \emph{information disclosure} and is usually categorized in \emph{identity} and \emph{attribute} disclosure~\cite{li2007t}.
\begin{itemize}
\item \emph{Identity}, an individual is linked to a particular record, with a probability higher than a desired threshold.
\item \emph{Attribute}, new information (attribute value) about an individual is revealed.
\end{itemize}
Note that identity disclosure can result in attribute disclosure, and vice versa.
\subsection{Attacks}
\label{subsec:privacy-attacks}
Information disclosure is augmented by \emph{adversarial attacks}, i.e.,~combining supplementary knowledge available to \emph{adversaries} with the released data, or setting unrealistic assumptions while designing the privacy preserving algorithms.
Below we list example attacks that appear in the works that we review:
\begin{itemize}
\item Knowledge about the sensitive attribute domain.
Here we can identify \emph{homogeneity and skewness} attacks~\cite{machanavajjhala2006diversity,li2007t}, based on knowledge of statistics on the sensitive attribute values, and \emph{similarity attack} based on semantic similarity between sensitive attribute values.
\item `Random' models of reasoning that make unrealistic assumptions in many scenarios such as the random world model, the i.i.d model, or the independent-tuples model.
These fall under the deFinetti's attack~\cite{kifer2009attacks}.
\item External data sources, e.g.,~geographical, demographic or other supplementary information.
Such~\emph{background knowledge} constitutes the \emph{linkage attack}~\cite{narayanan2008robust}, which helps link individuals with certain records or attributes.
\item Previous releases of the same and/or related data sets, i.e.,~\emph{temporal} and \emph{complementary release} attacks~\cite{sweeney2002k}.
In this category, we can also identify the \emph{unsorted matching} attack~\cite{sweeney2002k}, which is achieved when the original data set is considered in the same tuple ordering for different releases.
\item \emph{Data correlations} derived from previous data releases and/or other external sources~\cite{kifer2011no, chen2014correlated, zhu2015correlated, liu2016dependence, zhao2017dependent}.
In the literature that we review, the most prominent types of data correlations are:
\begin{itemize}
\item \emph{Spatiotemporal}~\cite{gotz2012maskit, fan2013differentially, xiao2015protecting, cao2017quantifying, ma2017plp}, appearing when processing time series or sequences of human activities with geolocation characteristics.
\item \emph{Feature}~\cite{ghinita2009preventing}, uncovered among features of released data sets, and
\item \emph{Serial} or \emph{autocorrelations}~\cite{li2007hiding, fan2013differentially, erdogdu2015privacy, wang2016rescuedp, wang2017cts}, characterized by dependencies between the elements in one series of data.
\end{itemize}
\end{itemize}
The first three categories of attacks mentioned above, have been addressed by several works in one-shot privacy preserving data publishing.
As the continuous publishing scheme is more relevant and realistic nowadays, more recent works deal with the later three types of attacks that take into account different releases.
\subsection{Levels}
\label{subsec:privacy-levels}
There are three levels of protection that the data publisher can consider: \emph{user-}, \emph{event-}, and \emph{$w$-event} privacy.
An \emph{event} is a (user, sensitive value) pair, e.g.,~the user $a$ is at location $l$.
\emph{User-}, and \emph{event-} privacy~\cite{dwork2010differential} are the main privacy levels; the former guarantees that all the events of any user \emph{for all timestamps} are protected, while the latter ensures that any single event \emph{at a specific timestamp} is protected.
Moreover, \emph{w-event}~\cite{kellaris2014differentially} attempts to bridge the gap between event and user level privacy in streaming settings, by protecting any event sequence of any user within a window of $w$ timestamps.
$w-$event is narrower than user level privacy, since it does not hide multiple event sequences from the same user, but when $w$ is set to infinity, $w$-event and user level notions converge.
Note that the described levels have been coined in the context of \emph{differential privacy}~\cite{dwork2006calibrating}, nevertheless, they may apply at other privacy protection techniques as well.
\subsection{Seminal works}
\label{subsec:privacy-seminal}
Next, we visit some of the most important methods proposed in the literature for data privacy (not necessarily defined for continuous data set publication).
Following the categorization of anonymization techniques as defined in~\cite{wang2010privacy}, we visit the most prominent algorithms using the operations of \emph{aggregation}, \emph{suppression}, \emph{generalization}, data \emph{perturbation}, and \emph{randomization}.
To hide sensitive data by aggregation we group together multiple rows of a table to form a single value; by suppression we delete completely certain sensitive values or entire records~\cite{gruteser2004protecting}; and by generalization we replace an attribute value with a parent value in the attribute taxonomy.
In perturbation, we disturb the initial attribute value in a deterministic or probabilistic approach.
When the distortion is done in a probabilistic way, we talk about randomization.
The first subsection on microdata publication, uses the four first perturbation techniques, while the second subsection on statistical data publication uses the last two techniques.
It is worth mentioning that there are a lot of contributions in the field of privacy preserving techniques based on \emph{cryptography}, e.g.,~\cite{benaloh2009patient, kamara2010cryptographic, cao2014privacy} that would be relevant to discuss.
However, the majority of these methods, among other assumptions that they make, have minimum or even no trust to the entities that handle personal information.
Furthermore, the amount and way of data processing of these techniques usually burden the overall procedure, deteriorate the utility of the resulting data sets, and restricts their applicability.
Therefore, our focus is limited to techniques that achieve a satisfying balance between both participants' privacy and data utility.
For these reasons, there will be no further discussion around this family of techniques in this article.
\subsubsection{Microdata}
Sweeney coined \emph{$k$-anonymity}~\cite{sweeney2002k}, one of the first established works on data privacy.
A released data set features $k$-anonymity protection when the sequence of values for a set of identifying attributes, called the \emph{quasi-identifiers}, is the same for at least $k$ records in the data set.
This constitutes an individual indistinguishable from at least $k{-}1$ other individuals in the same data set.
In a follow-up work~\cite{sweeney2002achieving}, the author describes a way to achieve $k$-anonymity for a data set by the suppression or generalization of certain values of the quasi-identifiers.
However, $k$-anonymity is not intolerant to external attacks of re-identification on the released data set.
The problematic settings identified in~\cite{sweeney2002k} appear when attempting to apply $k$-anonymity on continuous data publication (as we will also see in the next section).
These attacks include multiple $k$-anonymous data set releases with the same record order, subsequent releases of a data set without taking into account previous $k$-anonymous releases, and over time tuple changes.
Proposed solutions include rearranging the attributes, setting the whole attribute set of previously released data sets as quasi-identifiers or releasing data based on previous $k$-anonymous releases.
Machanavajjhala et al.~\cite{machanavajjhala2006diversity} pointed out that $k$-anonymity is vulnerable to homogeneity and background knowledge attacks.
Thereby, they proposed \emph{$l$-diversity} which demands that the values of the sensitive attributes are `well-represented' by $l$ sensitive values in each group.
Principally, a data set can be $l$-diverse by featuring at least $l$ distinct values for the sensitive field in each group (\emph{distinct} $l$-diversity).
Other instantiations demand that the entropy of the whole data set is greater than or equal to $\log(l)$ (\emph{entropy} $l$-diversity), or that the number of appearances of the most common sensitive value is less than the sum of the counts of the rest of the values multiplied by a user defined constant $c$ (\emph{recursive (c, l)}-diversity).
Later on, Li et al.~\cite{li2007t} indicated that $l$-diversity can be void by sensitive attributes with a small value range, skewness and similarity attacks.
In such cases, \emph{$\theta$-closeness} guarantees that the distribution of a sensitive attribute in a group and the distribution of the same attribute in the whole data set is similar.
This similarity is bounded by a threshold $\theta$.
A data set is said to have $\theta$-closeness when all of its groups have $\theta$-closeness.
\subsubsection{Statistical data}
While methods based on $k-$anonymity have been mainly employed when releasing microdata, \emph{differential privacy}~\cite{dwork2006calibrating} has been proposed for `privately' releasing high utility aggregates over microdata.
More precisely, differential privacy ensures that the removal or addition of a single data item, i.e.,~the record of an individual, in a released data set does not (substantially) affect the outcome of any analysis.
It is a statistical property of the privacy mechanism and is irrelevant to the computational power and auxiliary information available to the adversary.
In its formal definition, a privacy mechanism $M$ that introduces some randomness to the query result, provides $\varepsilon$-differential privacy, for a given privacy budget $\varepsilon$, if for all pairs of data sets $X$, $Y$ differing in one tuple, the following holds:
$$\Pr[M(X)\in S]\leq e^\varepsilon \Pr[M(Y)\in S]$$
where $\Pr[.]$ denotes the probability of an event, $S$ denotes the set of all possible worlds, and as previously noted, $\varepsilon$ represents the user-defined privacy budget or more precisely the privacy risk.
As the definition implies, for lower budget ($\varepsilon$) values a mechanism achieves stronger privacy protection because the probabilities of $X$ and $Y$ being true worlds are similar.
Differential privacy satisfies two composability properties: the sequential, and the parallel one~\cite{mcsherry2009privacy, soria2016big}.
Due to the \emph{sequential composability} property, the total privacy level of two independent mechanisms $M_1$ and $M_2$ over the same data set that satisfy $\varepsilon_1$ and $\varepsilon_2$ respectively, equals to $\varepsilon_1 + \varepsilon_2$.
The \emph{parallel composition} dictates that when the mechanisms $M_1$ and $M_2$ are applied over disjoint subsets of the same data set, then the overall privacy level is of $\underset{i}{argmax}(\varepsilon_i), i\in\{1,2\}$.
Methods based on differential privacy are best for low sensitivity queries such as counts, because the presence/absence of a single record can only change the result slightly.
However, sum and max queries can be problematic, since a single but very different value could change the answer noticeably, making it necessary to add a lot of noise to the query answer.
Furthermore, asking a series of queries may allow the disambiguation between possible worlds, making it necessary to add even more noise to the results.
For this reason, after a series of queries exhausts the privacy budget, the data set has to be discarded.
Keeping the original guarantee across $n$ queries, that require different/new answers, one must inject $n$ times the noise thus, destroying the utility of the output.
The notion of differential privacy has highly influenced the research community, resulting in many follow-up publications (~\cite{mcsherry2007mechanism, kifer2011no, zhang2017privbayes} to mention a few).
We distinguish here \emph{Pufferfish} proposed by Kifer et al.~\cite{kifer2014pufferfish}, an abstraction of differential privacy.
It is a framework that allows experts in an application domain, without necessarily having any particular expertise in privacy, to develop privacy definitions for their data sharing needs.
To define a privacy mechanism using \emph{Pufferfish}, one has to define a set of potential secrets \emph{S}, a set of discriminative pairs \emph{$S_{pairs}$}, and evolution scenarios \emph{D}.
\emph{S} serves an explicit specification of what we would like to protect, e.g.,~`the record of an individual $X$ is (not) in the data'.
\emph{$S_{pairs}$} is a subset of $S\times S$ that instructs how to protect the potential secrets $S$, e.g.,~(`$X$ is in the table', `$X$ is not in the table').
Finally, \emph{D} is a set of conservative assumptions about how the data evolved (or were generated) that reflects the adversary's belief about the data, e.g.,~probability distributions, variable correlations, etc.
When there is independence between all the records in $D$, then $\epsilon$-differential privacy and the privacy definition of $\epsilon$-\emph{Pufferfish}$(S, S_{pairs}, D)$ are equivalent.
\section{Scenario}
\label{sec:scenario}
\begin{figure}[tbp]
\centering\noindent\adjustbox{max width=\linewidth} {
\begin{tabular}{@{}ccc@{}}
\begin{tabular}{@{}llll@{}}
\toprule
\textbf{\textit{Name}} & \textbf{Age} & \textbf{Location} & \textbf{Context} \\
\midrule
Donald & 27 & Paris & `at work' \\
Daisy & 25 & Paris & `driving' \\
Huey & 12 & New York & `running' \\
Dewey & 11 & New York & `at home' \\
Louie & 10 & New York & `walking' \\
Quackmore & 62 & Paris & `nearest restos?' \\
\bottomrule
\end{tabular}
&
\begin{tabular}{@{}llll@{}}
\toprule
\textbf{\textit{Name}} & \textbf{Age} & \textbf{Location} & \textbf{Context} \\
\midrule
Donald & 27 & Athens & `driving' \\
Daisy & 25 & Athens & `where's Acropolis?' \\
Huey & 12 & Paris & `what's here?' \\
Dewey & 11 & Paris & `walking' \\
Louie & 10 & Paris & `at home' \\
Quackmore & 62 & Athens & `swimming' \\
\bottomrule
\end{tabular}
&
\ldots
\\
$t_0$ & $ t_1$ & \ldots \\
\end{tabular}
}
\caption{Raw user-generated data in tabular form, for two timestamps.}
\label{fig:scenario}
\end{figure}
To illustrate the usage of the two main aforementioned techniques for privacy preserving data publishing, we provide here an example scenario.
Users interact with a location based service by making queries within some context at various locations.
Figure~\ref{fig:scenario} shows a toy data set of user-generated data, in two subsequent timestamps $t_0$, and $t_1$ (Figure~\ref{fig:scenario}~(a) \&~(b) respectively).
Each table contains four attributes, namely \emph{Name} (the key of the relation), \emph{Age}, \emph{Location}, and \emph{Context}.
Location shows the spatial information (e.g.,~latitude, longitude or full address) related to the query.
Context includes information that characterizes the user's state, or the query itself.
Its content varies according to the service's functionality and is transmitted/received by the user.
\begin{figure}[tbp]
\centering\noindent\adjustbox{max width=\linewidth} {
\begin{tabular}{@{}ccc@{}}
\begin{tabular}{@{}llll@{}}
\toprule
\textbf{\textit{Name}} & \textbf{Age} & \textbf{Location} & \textbf{Context} \\
\midrule
* & $> 20$ & Paris & `at work' \\
* & $> 20$ & Paris & `driving' \\
* & $> 20$ & Paris & `nearest restos?' \\
\midrule
* & $\leq 20$ & New York & `running' \\
* & $\leq 20$ & New York & `at home' \\
* & $\leq 20$ & New York & `walking' \\
\bottomrule
\end{tabular}
&
\begin{tabular}{@{}llll@{}}
\toprule
\textbf{\textit{Name}} & \textbf{Age} & \textbf{Location} & \textbf{Context} \\
\midrule
* & $> 20$ & Athens & `driving' \\
* & $> 20$ & Athens & `where's Acropolis?' \\
* & $> 20$ & Athens & `swimming' \\
\midrule
* & $\leq 20$ & Paris & `what's here?' \\
* & $\leq 20$ & Paris & `walking' \\
* & $\leq 20$ & Paris & `at home' \\
\bottomrule
\end{tabular}
&
\ldots
\\
$t_0$ & $ t_1$ & \ldots \\
\end{tabular}
}
\caption{3-anonymous versions of the data in Figure~\ref{fig:scenario}.}
\label{fig:scenario-micro}
\end{figure}
First, we anonymize the data set of Figure~\ref{fig:scenario} using $k-$anonymity, with $k=3$.
This means that any user should not be distinguished from at least 2 others.
We start by suppressing the values of the Name attribute, which is the identifier.
The Age and Location attributes are the quasi-identifiers, so we proceed to adequately generalize them.
The age values are turned to ranges ($\leq 20$, and $> 20$), but no generalization is needed for location.
Note however that in a bigger data set typically we would have to generalize all the quasi-identifiers to achieve the desired anonymization level.
Finally, $3$-anonymity is achieved by putting the entries in groups of three, according to the quasi-identifiers.
Figure~\ref{fig:scenario-micro} depicts the results at each timestamp.
\begin{figure}[tbp]
\centering\noindent\adjustbox{max width=\linewidth} {
\begin{tabular}{@{}ccc@{}}
\begin{tabular}{@{}llll@{}}
\toprule
& $t_0$ & $t_1$ & \ldots \\
\midrule
Paris & $3$ & $3$ & \ldots \\
New York & $3$ & $0$ & \ldots \\
Athens & $0$ & $3$ & \ldots \\
\bottomrule
\end{tabular}
& $\overrightarrow{Noise}$ &
\begin{tabular}{@{}llll@{}}
\toprule
& $t_0$ & $t_1$ & \ldots \\
\midrule
Paris & $2$ & $0$ & \ldots \\
New York & $3$ & $2$ & \ldots \\
Athens & $1$ & $2$ & \ldots \\
\bottomrule
\end{tabular}
\\
(a) True counts & & (b) Private counts \\
\end{tabular}
}
\caption{(a) Aggregated data from Figure~\ref{fig:scenario}, and (b) $0.5$-differentially private versions of these data.}
\label{fig:scenario-stat}
\end{figure}
Next, we demonstrate differential privacy.
Let us assume that we want to release the number of users at each location, for each timestamp.
For this reason, we run a count query $Q$, with a \emph{Group By} clause on Location, over each table of Figure~\ref{fig:scenario}.
Figure~\ref{fig:scenario-stat}~(a) shows the results of these queries, which are called \emph{true counts}.
Then, we apply an $\varepsilon$-differentially private mechanism, with $\varepsilon = 0.5$, taking into account $Q$, and the data sets.
This mechanism adds some noise to the true counts.
A typical example is the \emph{Laplace} mechanism that draws randomly a value (with noise) from a Laplace distribution.
In our case, $\mu$ is equal to the true count, and $b$ is the sensitivity of the count function divided by the available privacy budget $\varepsilon$.
In the case of the count function, the sensitivity is $1$ since the addition/removal of a tuple from the data set can change the final result of the function by maximum $1$ (tuple).
Figure~\ref{fig:laplace} shows how the Laplace distribution for the true count for Paris at $t_0$ looks like.
Figure~\ref{fig:scenario-stat}~(b) shows all the perturbed counts that are going to be released.
\begin{figure}[tbp]
\centering
\includegraphics[width=0.5\linewidth]{laplace}
\caption{A Laplace distribution for \emph{location} $\mu = 3$ and \emph{scale} $b = 2$.}
\label{fig:laplace}
\end{figure}
Note that in this example, the applied privacy preserving approaches are intentionally quite simplistic for demonstration purposes, without taking into account data continuity and the more advanced attacks (e.g.,~background knowledge attack, temporal inference attack, etc.) described in Section~\ref{subsec:privacy-attacks}.
Follow-up works have been developed to meet these attacks, as we review in the next section.