Added the survey

This commit is contained in:
Manos Katsomallos 2019-03-05 19:01:18 +01:00
parent 6c331a740f
commit 67e975abbe
15 changed files with 940 additions and 69 deletions

View File

@ -1,2 +1,8 @@
\chapter{Abstract}
The abstract goes here.
Sensors, portable devices, and location based services, generate massive amounts of geo-tagged, and/or location- and user-related data on a daily basis.
Such data are useful in numerous application domains from healthcare to intelligent buildings and from crowdsourced data collection to traffic monitoring.
A lot of these data are referring to activities and carrying information of individuals and thus, their manipulation and sharing inevitably arise concerns about the privacy of the individuals involved.
To address this problem, researchers have already proposed various seminal techniques for the protection of users' privacy.
However, the continual fashion in which data are generated nowadays, and the high availability of external sources of information, pose more threats and add extra challenges to the problem.
% In this thesis, we are concerned with and present the work done on data privacy in support of continuous data publication, and report on the proposed solutions, with a special focus on solutions concerning location or georeferenced data.

View File

@ -1,8 +1,9 @@
\chapter{Acknowledgements}
Upon the completion of my thesis, I would like to express my deep gratitude to my research supervisors for their patient guidance, enthusiastic encouragement and useful critiques of this research work. Besides my advisors, I would like to thank the reporters as well as the rest of the jury for their invaluable contribution.
Upon the completion of my thesis, I would like to express my deep gratitude to my research supervisors for their patient guidance, enthusiastic encouragement and useful critiques of this research work.
Besides my advisors, I would like to thank the reporters as well as the rest of the jury for their invaluable contribution.
A special thanks to my departments faculty, staff and fellow students for their valuable assistance whenever needed and for creating a pleasant and creative environment during my studies.
Last but not least, I wish to thank my family and friends for their unconditional support and encouragement all these years.
Pontoise, September X, 2019
Cergy-Pontoise, MM DD, 2019

360
background.tex Normal file
View File

@ -0,0 +1,360 @@
\chapter{Background}
\label{ch:background}
In this section, we introduce some relevant terminology and background knowledge around the problem of continuous publication of private data sets.
\section{Data}
\label{sec:data}
\subsection{Categories}
\label{subsec:data-categories}
As this survey is about privacy, the data that we are interested in contain information about individuals and their actions, called also \emph{microdata}, in their raw, usually tabular form.
When the data are aggregated, or transformed using a statistical analysis task, we talk about \emph{statistical data}.
Data in either of these two forms, may have a special property called~\emph{continuity}, and data with continuity are further classified into the following three categories:
\begin{itemize}
\item \emph{Data stream} is a possibly \emph{unbounded} series of data.
\item \emph{Sequential data} is a series of \emph{dependent} data, e.g.,~trajectories.
\item \emph{Time series} is a series of data points \emph{indexed in time}.
\end{itemize}
Note that data may fall into more than one of these categories, e.g.,~we may have data streams of dependent data, or dependent data indexed in time.
\subsection{Processing}
\label{subsec:data-processing}
The traditional flow of crowdsourced data, as shown in Figure~\ref{fig:data-flow}, and witnessed in the majority of works that we cover, corresponds to a centralized architecture, i.e.,~data are collected, processed and anonymized, and published by an intermediate (trusted) entity~\cite{mcsherry2009privacy, blocki2013differentially, johnson2018towards}.
On the other side, we are also aware of several solutions proposed to support a decentralized privacy preserving schema~\cite{andres2013geo, erlingsson2014rappor, katsomallos2017open}.
According to this approach, the anonymization process takes place locally, on the side of the data producers, before sending the data to any intermediate entity or, in some cases, directly to the data consumers.
In this way, the resulting anonymization procedure is more independent from third-party entities, avoiding the risk of unpredicted privacy leakage from a compromised data publisher.
Contrary to centralized architectures, most of such multi-party computation approaches fail to generate elaborate data statistics, to offer a complete and up-to-date view of the events happening on the data producers' side, and thus, to maximize the utility of the available data.
For this reason, the centralized paradigm is more popular and hence, in this survey we also focus on works in this area.
A data aggregation or a privacy preservation process is performed in the following two prominent data processing paradigms:
\begin{itemize}
\item \emph{Batch} allows the --- usually offline --- processing of a (large) block of data, collected over a period of time, which could as well be a complete data set.
Batch processing excels at queries that require answers with high accuracy, since decisions are made based on observations on the whole data set.
\item \emph{Streaming} refers to the --- usually online --- processing of a possibly unbounded sequence of data items of small volume (in contrast to batch).
By nature, the processing is performed on the data subsets available at each point in time, and is ideal for time-critical queries that require (near) real-time analytics.
\end{itemize}
\subsection{Publishing}
\label{subsec:data-publishing}
Either raw or processed, data can be subsequently published in one of the following publication methods:
\begin{itemize}
\item \emph{One-shot} is the one-off publication of the (whole) data set at one time point.
\item \emph{Continuous} is the publishing of data sets in an uninterrupted manner.
The continuous publication scheme can be organized as shown below:
\begin{itemize}
\item \emph{Continual} refers to the publishing of data, characterized by some periodicity.
Slightly abusing the terminology, the terms `continuous' and `continual' often appear interchangeably in the literature.
\item \emph{Sequential} refers to the publishing of views of an original data set or updates thereof, the one after the other.
By definition the different releases are related to each other
\begin{itemize}
\item \emph{Incremental} is the sequential publishing of the $\Delta$ (i.e.,~the difference) over the previous data set release.
\end{itemize}
\end{itemize}
\end{itemize}
\section{Privacy}
\label{sec:privacy}
When personal data are publicly released, either as microdata or statistical data, individuals' privacy can be compromised.
In the literature this compromise is know as \emph{information disclosure} and is usually categorized in \emph{identity} and \emph{attribute} disclosure~\cite{li2007t}.
\begin{itemize}
\item \emph{Identity}, an individual is linked to a particular record, with a probability higher than a desired threshold.
\item \emph{Attribute}, new information (attribute value) about an individual is revealed.
\end{itemize}
Note that identity disclosure can result in attribute disclosure, and vice versa.
\subsection{Attacks}
\label{subsec:privacy-attacks}
Information disclosure is augmented by \emph{adversarial attacks}, i.e.,~combining supplementary knowledge available to \emph{adversaries} with the released data, or setting unrealistic assumptions while designing the privacy preserving algorithms.
Below we list example attacks that appear in the works that we review:
\begin{itemize}
\item Knowledge about the sensitive attribute domain.
Here we can identify \emph{homogeneity and skewness} attacks~\cite{machanavajjhala2006diversity,li2007t}, based on knowledge of statistics on the sensitive attribute values, and \emph{similarity attack} based on semantic similarity between sensitive attribute values.
\item `Random' models of reasoning that make unrealistic assumptions in many scenarios such as the random world model, the i.i.d model, or the independent-tuples model.
These fall under the deFinetti's attack~\cite{kifer2009attacks}.
\item External data sources, e.g.,~geographical, demographic or other supplementary information.
Such~\emph{background knowledge} constitutes the \emph{linkage attack}~\cite{narayanan2008robust}, which helps link individuals with certain records or attributes.
\item Previous releases of the same and/or related data sets, i.e.,~\emph{temporal} and \emph{complementary release} attacks~\cite{sweeney2002k}.
In this category, we can also identify the \emph{unsorted matching} attack~\cite{sweeney2002k}, which is achieved when the original data set is considered in the same tuple ordering for different releases.
\item \emph{Data correlations} derived from previous data releases and/or other external sources~\cite{kifer2011no, chen2014correlated, zhu2015correlated, liu2016dependence, zhao2017dependent}.
In the literature that we review, the most prominent types of data correlations are:
\begin{itemize}
\item \emph{Spatiotemporal}~\cite{gotz2012maskit, fan2013differentially, xiao2015protecting, cao2017quantifying, ma2017plp}, appearing when processing time series or sequences of human activities with geolocation characteristics.
\item \emph{Feature}~\cite{ghinita2009preventing}, uncovered among features of released data sets, and
\item \emph{Serial} or \emph{autocorrelations}~\cite{li2007hiding, fan2013differentially, erdogdu2015privacy, wang2016rescuedp, wang2017cts}, characterized by dependencies between the elements in one series of data.
\end{itemize}
\end{itemize}
The first three categories of attacks mentioned above, have been addressed by several works in one-shot privacy preserving data publishing.
As the continuous publishing scheme is more relevant and realistic nowadays, more recent works deal with the later three types of attacks that take into account different releases.
\subsection{Levels}
\label{subsec:privacy-levels}
There are three levels of protection that the data publisher can consider: \emph{user-}, \emph{event-}, and \emph{$w$-event} privacy.
An \emph{event} is a (user, sensitive value) pair, e.g.,~the user $a$ is at location $l$.
\emph{User-}, and \emph{event-} privacy~\cite{dwork2010differential} are the main privacy levels; the former guarantees that all the events of any user \emph{for all timestamps} are protected, while the latter ensures that any single event \emph{at a specific timestamp} is protected.
Moreover, \emph{w-event}~\cite{kellaris2014differentially} attempts to bridge the gap between event and user level privacy in streaming settings, by protecting any event sequence of any user within a window of $w$ timestamps.
$w-$event is narrower than user level privacy, since it does not hide multiple event sequences from the same user, but when $w$ is set to infinity, $w$-event and user level notions converge.
Note that the described levels have been coined in the context of \emph{differential privacy}~\cite{dwork2006calibrating}, nevertheless, they may apply at other privacy protection techniques as well.
\subsection{Seminal works}
\label{subsec:privacy-seminal}
Next, we visit some of the most important methods proposed in the literature for data privacy (not necessarily defined for continuous data set publication).
Following the categorization of anonymization techniques as defined in~\cite{wang2010privacy}, we visit the most prominent algorithms using the operations of \emph{aggregation}, \emph{suppression}, \emph{generalization}, data \emph{perturbation}, and \emph{randomization}.
To hide sensitive data by aggregation we group together multiple rows of a table to form a single value; by suppression we delete completely certain sensitive values or entire records~\cite{gruteser2004protecting}; and by generalization we replace an attribute value with a parent value in the attribute taxonomy.
In perturbation, we disturb the initial attribute value in a deterministic or probabilistic approach.
When the distortion is done in a probabilistic way, we talk about randomization.
The first subsection on microdata publication, uses the four first perturbation techniques, while the second subsection on statistical data publication uses the last two techniques.
It is worth mentioning that there are a lot of contributions in the field of privacy preserving techniques based on \emph{cryptography}, e.g.,~\cite{benaloh2009patient, kamara2010cryptographic, cao2014privacy} that would be relevant to discuss.
However, the majority of these methods, among other assumptions that they make, have minimum or even no trust to the entities that handle personal information.
Furthermore, the amount and way of data processing of these techniques usually burden the overall procedure, deteriorate the utility of the resulting data sets, and restricts their applicability.
Therefore, our focus is limited to techniques that achieve a satisfying balance between both participants' privacy and data utility.
For these reasons, there will be no further discussion around this family of techniques in this article.
\subsubsection{Microdata}
Sweeney coined \emph{$k$-anonymity}~\cite{sweeney2002k}, one of the first established works on data privacy.
A released data set features $k$-anonymity protection when the sequence of values for a set of identifying attributes, called the \emph{quasi-identifiers}, is the same for at least $k$ records in the data set.
This constitutes an individual indistinguishable from at least $k{-}1$ other individuals in the same data set.
In a follow-up work~\cite{sweeney2002achieving}, the author describes a way to achieve $k$-anonymity for a data set by the suppression or generalization of certain values of the quasi-identifiers.
However, $k$-anonymity is not intolerant to external attacks of re-identification on the released data set.
The problematic settings identified in~\cite{sweeney2002k} appear when attempting to apply $k$-anonymity on continuous data publication (as we will also see in the next section).
These attacks include multiple $k$-anonymous data set releases with the same record order, subsequent releases of a data set without taking into account previous $k$-anonymous releases, and over time tuple changes.
Proposed solutions include rearranging the attributes, setting the whole attribute set of previously released data sets as quasi-identifiers or releasing data based on previous $k$-anonymous releases.
Machanavajjhala et al.~\cite{machanavajjhala2006diversity} pointed out that $k$-anonymity is vulnerable to homogeneity and background knowledge attacks.
Thereby, they proposed \emph{$l$-diversity} which demands that the values of the sensitive attributes are `well-represented' by $l$ sensitive values in each group.
Principally, a data set can be $l$-diverse by featuring at least $l$ distinct values for the sensitive field in each group (\emph{distinct} $l$-diversity).
Other instantiations demand that the entropy of the whole data set is greater than or equal to $\log(l)$ (\emph{entropy} $l$-diversity), or that the number of appearances of the most common sensitive value is less than the sum of the counts of the rest of the values multiplied by a user defined constant $c$ (\emph{recursive (c, l)}-diversity).
Later on, Li et al.~\cite{li2007t} indicated that $l$-diversity can be void by sensitive attributes with a small value range, skewness and similarity attacks.
In such cases, \emph{$\theta$-closeness} guarantees that the distribution of a sensitive attribute in a group and the distribution of the same attribute in the whole data set is similar.
This similarity is bounded by a threshold $\theta$.
A data set is said to have $\theta$-closeness when all of its groups have $\theta$-closeness.
\subsubsection{Statistical data}
While methods based on $k-$anonymity have been mainly employed when releasing microdata, \emph{differential privacy}~\cite{dwork2006calibrating} has been proposed for `privately' releasing high utility aggregates over microdata.
More precisely, differential privacy ensures that the removal or addition of a single data item, i.e.,~the record of an individual, in a released data set does not (substantially) affect the outcome of any analysis.
It is a statistical property of the privacy mechanism and is irrelevant to the computational power and auxiliary information available to the adversary.
In its formal definition, a privacy mechanism $M$ that introduces some randomness to the query result, provides $\varepsilon$-differential privacy, for a given privacy budget $\varepsilon$, if for all pairs of data sets $X$, $Y$ differing in one tuple, the following holds:
$$\Pr[M(X)\in S]\leq e^\varepsilon \Pr[M(Y)\in S]$$
where $\Pr[.]$ denotes the probability of an event, $S$ denotes the set of all possible worlds, and as previously noted, $\varepsilon$ represents the user-defined privacy budget or more precisely the privacy risk.
As the definition implies, for lower budget ($\varepsilon$) values a mechanism achieves stronger privacy protection because the probabilities of $X$ and $Y$ being true worlds are similar.
Differential privacy satisfies two composability properties: the sequential, and the parallel one~\cite{mcsherry2009privacy, soria2016big}.
Due to the \emph{sequential composability} property, the total privacy level of two independent mechanisms $M_1$ and $M_2$ over the same data set that satisfy $\varepsilon_1$ and $\varepsilon_2$ respectively, equals to $\varepsilon_1 + \varepsilon_2$.
The \emph{parallel composition} dictates that when the mechanisms $M_1$ and $M_2$ are applied over disjoint subsets of the same data set, then the overall privacy level is of $\underset{i}{argmax}(\varepsilon_i), i\in\{1,2\}$.
Methods based on differential privacy are best for low sensitivity queries such as counts, because the presence/absence of a single record can only change the result slightly.
However, sum and max queries can be problematic, since a single but very different value could change the answer noticeably, making it necessary to add a lot of noise to the query answer.
Furthermore, asking a series of queries may allow the disambiguation between possible worlds, making it necessary to add even more noise to the results.
For this reason, after a series of queries exhausts the privacy budget, the data set has to be discarded.
Keeping the original guarantee across $n$ queries, that require different/new answers, one must inject $n$ times the noise thus, destroying the utility of the output.
The notion of differential privacy has highly influenced the research community, resulting in many follow-up publications (~\cite{mcsherry2007mechanism, kifer2011no, zhang2017privbayes} to mention a few).
We distinguish here \emph{Pufferfish} proposed by Kifer et al.~\cite{kifer2014pufferfish}, an abstraction of differential privacy.
It is a framework that allows experts in an application domain, without necessarily having any particular expertise in privacy, to develop privacy definitions for their data sharing needs.
To define a privacy mechanism using \emph{Pufferfish}, one has to define a set of potential secrets \emph{S}, a set of discriminative pairs \emph{$S_{pairs}$}, and evolution scenarios \emph{D}.
\emph{S} serves an explicit specification of what we would like to protect, e.g.,~`the record of an individual $X$ is (not) in the data'.
\emph{$S_{pairs}$} is a subset of $S\times S$ that instructs how to protect the potential secrets $S$, e.g.,~(`$X$ is in the table', `$X$ is not in the table').
Finally, \emph{D} is a set of conservative assumptions about how the data evolved (or were generated) that reflects the adversary's belief about the data, e.g.,~probability distributions, variable correlations, etc.
When there is independence between all the records in $D$, then $\epsilon$-differential privacy and the privacy definition of $\epsilon$-\emph{Pufferfish}$(S, S_{pairs}, D)$ are equivalent.
\section{Scenario}
\label{sec:scenario}
\begin{figure}[tbp]
\centering\noindent\adjustbox{max width=\linewidth} {
\begin{tabular}{@{}ccc@{}}
\begin{tabular}{@{}llll@{}}
\toprule
\textbf{\textit{Name}} & \textbf{Age} & \textbf{Location} & \textbf{Context} \\
\midrule
Donald & 27 & Paris & `at work' \\
Daisy & 25 & Paris & `driving' \\
Huey & 12 & New York & `running' \\
Dewey & 11 & New York & `at home' \\
Louie & 10 & New York & `walking' \\
Quackmore & 62 & Paris & `nearest restos?' \\
\bottomrule
\end{tabular}
&
\begin{tabular}{@{}llll@{}}
\toprule
\textbf{\textit{Name}} & \textbf{Age} & \textbf{Location} & \textbf{Context} \\
\midrule
Donald & 27 & Athens & `driving' \\
Daisy & 25 & Athens & `where's Acropolis?' \\
Huey & 12 & Paris & `what's here?' \\
Dewey & 11 & Paris & `walking' \\
Louie & 10 & Paris & `at home' \\
Quackmore & 62 & Athens & `swimming' \\
\bottomrule
\end{tabular}
&
\ldots
\\
$t_0$ & $ t_1$ & \ldots \\
\end{tabular}
}
\caption{Raw user-generated data in tabular form, for two timestamps.}
\label{fig:scenario}
\end{figure}
To illustrate the usage of the two main aforementioned techniques for privacy preserving data publishing, we provide here an example scenario.
Users interact with a location based service by making queries within some context at various locations.
Figure~\ref{fig:scenario} shows a toy data set of user-generated data, in two subsequent timestamps $t_0$, and $t_1$ (Figure~\ref{fig:scenario}~(a) \&~(b) respectively).
Each table contains four attributes, namely \emph{Name} (the key of the relation), \emph{Age}, \emph{Location}, and \emph{Context}.
Location shows the spatial information (e.g.,~latitude, longitude or full address) related to the query.
Context includes information that characterizes the user's state, or the query itself.
Its content varies according to the service's functionality and is transmitted/received by the user.
\begin{figure}[tbp]
\centering\noindent\adjustbox{max width=\linewidth} {
\begin{tabular}{@{}ccc@{}}
\begin{tabular}{@{}llll@{}}
\toprule
\textbf{\textit{Name}} & \textbf{Age} & \textbf{Location} & \textbf{Context} \\
\midrule
* & $> 20$ & Paris & `at work' \\
* & $> 20$ & Paris & `driving' \\
* & $> 20$ & Paris & `nearest restos?' \\
\midrule
* & $\leq 20$ & New York & `running' \\
* & $\leq 20$ & New York & `at home' \\
* & $\leq 20$ & New York & `walking' \\
\bottomrule
\end{tabular}
&
\begin{tabular}{@{}llll@{}}
\toprule
\textbf{\textit{Name}} & \textbf{Age} & \textbf{Location} & \textbf{Context} \\
\midrule
* & $> 20$ & Athens & `driving' \\
* & $> 20$ & Athens & `where's Acropolis?' \\
* & $> 20$ & Athens & `swimming' \\
\midrule
* & $\leq 20$ & Paris & `what's here?' \\
* & $\leq 20$ & Paris & `walking' \\
* & $\leq 20$ & Paris & `at home' \\
\bottomrule
\end{tabular}
&
\ldots
\\
$t_0$ & $ t_1$ & \ldots \\
\end{tabular}
}
\caption{3-anonymous versions of the data in Figure~\ref{fig:scenario}.}
\label{fig:scenario-micro}
\end{figure}
First, we anonymize the data set of Figure~\ref{fig:scenario} using $k-$anonymity, with $k=3$.
This means that any user should not be distinguished from at least 2 others.
We start by suppressing the values of the Name attribute, which is the identifier.
The Age and Location attributes are the quasi-identifiers, so we proceed to adequately generalize them.
The age values are turned to ranges ($\leq 20$, and $> 20$), but no generalization is needed for location.
Note however that in a bigger data set typically we would have to generalize all the quasi-identifiers to achieve the desired anonymization level.
Finally, $3$-anonymity is achieved by putting the entries in groups of three, according to the quasi-identifiers.
Figure~\ref{fig:scenario-micro} depicts the results at each timestamp.
\begin{figure}[tbp]
\centering\noindent\adjustbox{max width=\linewidth} {
\begin{tabular}{@{}ccc@{}}
\begin{tabular}{@{}llll@{}}
\toprule
& $t_0$ & $t_1$ & \ldots \\
\midrule
Paris & $3$ & $3$ & \ldots \\
New York & $3$ & $0$ & \ldots \\
Athens & $0$ & $3$ & \ldots \\
\bottomrule
\end{tabular}
& $\overrightarrow{Noise}$ &
\begin{tabular}{@{}llll@{}}
\toprule
& $t_0$ & $t_1$ & \ldots \\
\midrule
Paris & $2$ & $0$ & \ldots \\
New York & $3$ & $2$ & \ldots \\
Athens & $1$ & $2$ & \ldots \\
\bottomrule
\end{tabular}
\\
(a) True counts & & (b) Private counts \\
\end{tabular}
}
\caption{(a) Aggregated data from Figure~\ref{fig:scenario}, and (b) $0.5$-differentially private versions of these data.}
\label{fig:scenario-stat}
\end{figure}
Next, we demonstrate differential privacy.
Let us assume that we want to release the number of users at each location, for each timestamp.
For this reason, we run a count query $Q$, with a \emph{Group By} clause on Location, over each table of Figure~\ref{fig:scenario}.
Figure~\ref{fig:scenario-stat}~(a) shows the results of these queries, which are called \emph{true counts}.
Then, we apply an $\varepsilon$-differentially private mechanism, with $\varepsilon = 0.5$, taking into account $Q$, and the data sets.
This mechanism adds some noise to the true counts.
A typical example is the \emph{Laplace} mechanism that draws randomly a value (with noise) from a Laplace distribution.
In our case, $\mu$ is equal to the true count, and $b$ is the sensitivity of the count function divided by the available privacy budget $\varepsilon$.
In the case of the count function, the sensitivity is $1$ since the addition/removal of a tuple from the data set can change the final result of the function by maximum $1$ (tuple).
Figure~\ref{fig:laplace} shows how the Laplace distribution for the true count for Paris at $t_0$ looks like.
Figure~\ref{fig:scenario-stat}~(b) shows all the perturbed counts that are going to be released.
\begin{figure}[tbp]
\centering
\includegraphics[width=0.5\linewidth]{laplace}
\caption{A Laplace distribution for \emph{location} $\mu = 3$ and \emph{scale} $b = 2$.}
\label{fig:laplace}
\end{figure}
Note that in this example, the applied privacy preserving approaches are intentionally quite simplistic for demonstration purposes, without taking into account data continuity and the more advanced attacks (e.g.,~background knowledge attack, temporal inference attack, etc.) described in Section~\ref{subsec:privacy-attacks}.
Follow-up works have been developed to meet these attacks, as we review in the next section.

37
discussion.tex Normal file
View File

@ -0,0 +1,37 @@
\subsection{Discussion}
\label{subsec:discussion}
In the previous sections we provided a brief summary and review for each work that falls into the categories of Microdata and Statistical Data privacy preserving publication under continual data schemes.
The main elements that have been summarized in Table~\ref{tab:related} allow us to make some interesting observations, on each category individually, and more generally.
In the Statistical Data section, all of the works deal with data linkage attacks, while there are some more recent works taking into consideration possible data correlations as well.
We notice that data linkage is currently assumed in the bibliography as the worst case scenario.
For this reason, works in the Statistical Data category provide a robust privacy protection solution independent to the adversaries' knowledge.
The prevailing distortion method in this category is probabilistic perturbation.
This is justified by the fact that nearly all of the observed methods are based on differential privacy.
The majority implements the Laplace mechanism, while some of them offer an adaptive approach.
In the Microdata category we observe that problems with sequential data, i.e.,~data that are generated in a sequence and dependent on the values in previous data sets, are more prominent.
It is important to note that works on this set of problems actually followed similar scenarios, i.e.,~publishing updated versions of an original data set, either vertically (schema-wise) or horizontally (tuple-wise).
Naturally, in such cases the most evident attack scenarios are the complementary release ones, as in each release there is great probability that there will be an intersection of tuples with previous releases.
On the other hand, when the problem has stream data/processing, we observe that these data are location specific, most commonly trajectories.
In such cases, the attacks considered are wider (than only versions of an original data set), taking into account external information, e.g.,~correlations that typically may be available for location specific data.
Speaking of correlations, in either category, we may see that the protection method used is mainly probabilistic, if not total suppression.
This makes sense, since by generalization the correlation between attributes would not be canceled.
Generalization is used naturally on grouped-based techniques, to make it possible to group more tuples under the generated categories --- and thus achieve anonymization.
As far as the protection levels are concerned, the Microdata category mainly targets event level protection, as all users are protected equally through the performed grouping.
Still, scenarios that contain trajectories, associated with a certain user aim to protect this user's privacy by blurring the actual trajectories (user-level).
$w-$event level is absent in the Microdata category; one reason maybe that streaming scenarios are not prominent in this category, and another practical reason may be that this notion was introduced later in time.
Indeed, none of the works in the Microdata category explicitly mention the level of privacy, as these levels have been introduced in differential privacy scenarios, hence in Statistical Data.
Considering all the use cases from both categories, event-level protection is more prominent, as it is more practical to protect all the users as a single set than each one individually in continual settings.
As already discussed, problems with streaming processing are not common in the Microdata category.
Indeed, most of the cases including streaming scenarios are in the Statistical Data category.
A technical reason behind this observation is that anonymizing a raw data set as a whole, may be a time-consuming process, and thus, not well-suited for streaming.
The complexity actually depends on the number of attributes, if we consider the possible combinations that may be enumerated.
On the contrary, aggregation functions as used in the Statistical Data category, especially in the absence of filters, usually are low cost.
Moreover, perturbing a numerical values (the usual result of an aggregation function) does not add a lot in the complexity of the algorithm (depending of course on the perturbation model used).
For this reason, perturbing the result of a process is more time efficient than anonymizing the data set and then running the process on the anonymized data.
Still, we may argue that an anonymized data set can be more widely used; in the case of statistical data it is only the data holder that performs the processes and releases the results.

BIN
graphics/data-flow.pdf Normal file

Binary file not shown.

BIN
graphics/data-value.pdf Normal file

Binary file not shown.

BIN
graphics/laplace.pdf Normal file

Binary file not shown.

View File

Before

Width:  |  Height:  |  Size: 19 KiB

After

Width:  |  Height:  |  Size: 19 KiB

131
graphics/table-related.tex Normal file
View File

@ -0,0 +1,131 @@
{
\setlength\tabcolsep{2pt}
\fontsize{5.35}{7.5}\selectfont
\begin{longtabu} [c]{@{} *{9}l @{}}
\toprule
\multirow{2}{*}[-2pt]{\textbf{Name}} & \multicolumn{3}{c}{\textbf{Data}} & \multicolumn{4}{c}{\textbf{Protection}} & \multirow{2}{*}[-2pt]{\textbf{Correlations}} \\ \cmidrule(l{2pt}r{2pt}){2-4} \cmidrule(l{2pt}r{2pt}){5-8}
& \textbf{Input/Output} & \textbf{Processing} & \textbf{Publishing} & \textbf{Attack} & \textbf{Method} & \textbf{Level} & \textbf{Distortion} & \\ \midrule \endhead
\multicolumn{9}{c}{\textbf{Microdata}} \\ \midrule
\hyperlink{he2011preventing}{\emph{$e-$equivalence}}~\cite{he2011preventing} & stream & batch & sequential & complementary & $l$-diversity & event & generalization & - \\
& & & & release & & & & \\ \tabucline[hdashline]{-}
\hyperlink{li2016hybrid}{Li et al.}~\cite{li2016hybrid} & stream & batch & sequential & complementary & $l$-diversity & event & generalization, & - \\
& & & & release & & & perturbation & \\ \tabucline[hdashline]{-}
\hyperlink{zhou2009continuous}{Zhou et al.}~\cite{zhou2009continuous} & stream & streaming & continuous & same with & $k$-anonymity & event & generalization, & - \\
& & & & $k$-anonymity~\cite{sweeney2002k} & & & randomization & \\ \tabucline[hdashline]{-}
\hyperlink{gotz2012maskit}{\textbf{\emph{MaskIt}}}~\cite{gotz2012maskit} & stream & streaming & continuous & correlations & $\delta$-privacy & user & suppression & temporal \\
& & & & & & & & (Markov) \\ \tabucline[hdashline]{-}
\hyperlink{ma2017plp}{\textbf{\emph{PLP}}}~\cite{ma2017plp} & stream & streaming & continuous & correlations & $\delta$-privacy & user & suppression & spatiotem- \\
& & & & & & & (probabilistic) & poral (CRFs) \\
\midrule
\hyperlink{wang2006anonymizing}{\emph{$(X, Y)-$}} & sequential & batch & sequential & complementary & $k$-anonymity & event & generalization & - \\
\hyperlink{wang2006anonymizing}{\emph{privacy}}~\cite{wang2006anonymizing} & & & & release & & & & \\ \tabucline[hdashline]{-}
\hyperlink{Shmueli}{Shmueli and} & sequential & batch & sequential & same with & $l$-diversity & event & generalization & - \\
\hyperlink{Shmueli}{Tassa}~\cite{shmueli2015privacy} & & & & $l$-diversity~\cite{machanavajjhala2006diversity} & & & & \\ \tabucline[hdashline]{-}
\hyperlink{xiao2007m}{\emph{$m-$invariance}}~\cite{xiao2007m} & sequential & batch & sequential & complementary & $l$-diversity & event & generalization & - \\
& & & & release & & & & \\ \tabucline[hdashline]{-}
\hyperlink{chen2011differentially}{\textbf{Chen et al.}}~\cite{chen2011differentially} & sequential & batch & one-shot & linkage & differential & user & perturbation & - \\
& & & & & privacy & & (Laplace) & \\ \tabucline[hdashline]{-}
\hyperlink{jiang2013publishing}{\textbf{Jiang et al.}}~\cite{jiang2013publishing} & sequential & batch & one-shot & linkage & differential & user & perturbation & - \\
& & & & & privacy & & (Laplace) & \\ \tabucline[hdashline]{-}
\hyperlink{fung2008anonymity}{\emph{$BCF-$}} & sequential & batch & incremental & complementary & $k$-anonymity & event & generalization & - \\
\hyperlink{fung2008anonymity}{\emph{anonymity}}~\cite{fung2008anonymity} & & & & release & & & & \\ \tabucline[hdashline]{-}
\hyperlink{xiao2015protecting}{\textbf{Xiao et al.}}~\cite{xiao2015protecting} & sequential & streaming & sequential & correlations & $\delta$-location set & user & \emph{Planar Isotropic} & temporal \\
& & & & & & & \emph{Mechanism (PIM)} & (Markov) \\ \tabucline[hdashline]{-}
\hyperlink{al2018adaptive}{\textbf{Al-Dhubhani et al.}}~\cite{al2018adaptive} & sequential & streaming & sequential & correlations & geo-indistin- & user & perturbation & temporal \\
& & & & & guishability & & (Planar Laplace) & \\ \tabucline[hdashline]{-}
\hyperlink{ghinita2009preventing}{\textbf{Ghinita et al.}}~\cite{ghinita2009preventing} & sequential & streaming & sequential & linkage & spatiotemporal & user & generalization (spatio- & feature \\
& & & & & transformation & & temporal cloaking) & \\
\midrule
\hyperlink{primault2015time}{\textbf{\emph{Promesse}}}~\cite{primault2015time} & time series & batch & one-shot & spatiotemporal & temporal & user & perturbation & - \\
& & & & inference & transformation & & (temporal) & \\ \midrule
\multicolumn{9}{c}{\textbf{Statistical Data}} \\ \midrule
\hyperlink{chan2011private}{Chan et al.}~\cite{chan2011private} & stream/ & streaming & continual & linkage & differential & event & perturbation & - \\
& continual & & & & privacy & & (Laplace) & \\ \tabucline[hdashline]{-}
\hyperlink{cao2015differentially}{\textbf{\emph{l-trajectory}}}~\cite{cao2015differentially} & stream/ & streaming & continuous & linkage & $l-$trajectory & w-event & perturbation & - \\
& time series & & & & & personalized & (dynamic Laplace) & \\ \tabucline[hdashline]{-}
\hyperlink{bolot2013private}{Bolot et al.}~\cite{bolot2013private} & stream & streaming & continual & linkage & differential & event & perturbation & - \\
& & & & & privacy & & (Laplace) & \\ \tabucline[hdashline]{-}
\hyperlink{quoc2017privapprox}{\emph{PrivApprox}}~\cite{quoc2017privapprox} & stream & streaming & continual & linkage & zero & event & perturbation (ran- & - \\
& & & & & knowledge & & domized response) & \\ \tabucline[hdashline]{-}
\hyperlink{li2007hiding}{Li et al.}~\cite{li2007hiding} & stream & streaming & continuous & linkage & randomization & event & perturbation & serial \\
& & & & & & & (dynamic) & (data trends) \\ \tabucline[hdashline]{-}
\hyperlink{chen2017pegasus}{\emph{PeGaSus}}~\cite{chen2017pegasus} & stream & streaming & continuous & linkage & differential & event & perturbation & - \\
& & & & & privacy & & (Laplace) & \\ \tabucline[hdashline]{-}
\hyperlink{cao2017quantifying}{Cao et al.}~\cite{cao2017quantifying} & stream & streaming & continuous & correlations & differential & event & perturbation & temporal \\
& & & & & privacy & & (Laplace) & (Markov) \\ \tabucline[hdashline]{-}
\hyperlink{kellaris2014differentially}{Kellaris et al.}~\cite{kellaris2014differentially} & stream & streaming & continuous & linkage & differential & w-event & perturbation & - \\
& & & & & privacy & & (dynamic Laplace) & \\ \tabucline[hdashline]{-}
\hyperlink{wang2016rescuedp}{\textbf{\emph{RescueDP}}}~\cite{wang2016rescuedp} & stream & streaming & continuous & linkage & differential & w-event & perturbation & serial \\
& & & & & privacy & & (dynamic Laplace) & (Pearson's r) \\
\midrule
\hyperlink{kellaris2013practical}{Kellaris et al.}~\cite{kellaris2013practical} & sequential & batch & one-shot & linkage & differential & event & perturbation & - \\
& & & & & privacy & & (Laplace) & \\ \tabucline[hdashline]{-}
\hyperlink{chen2012differentially}{\textbf{Chen et al.}}~\cite{chen2012differentially} & sequential & batch & one-shot & linkage & differential & user & perturbation & - \\
& & & & & privacy & & (adaptive Laplace) & \\ \tabucline[hdashline]{-}
\hyperlink{hua2015differentially}{\textbf{Hua et al.}}~\cite{hua2015differentially} & sequential & batch & one-shot & linkage & differential & user & perturbation (ex- & - \\
& & & & & privacy & & ponential, Laplace) & \\ \tabucline[hdashline]{-}
\hyperlink{li2017achieving}{\textbf{Li et al.}}~\cite{li2017achieving} & sequential & batch & one-shot & linkage & differential & user & perturbation & - \\
& & & & & privacy & & (Laplace) & \\
\midrule
\hyperlink{erdogdu2015privacy}{Erdogdu et al.}~\cite{erdogdu2015privacy} & time series & batch/ & continual & correlations & $\epsilon_t$-privacy & user & perturbation & serial \\
& & streaming & & & & & (stohastic) & (HMM) \\ \tabucline[hdashline]{-}
\hyperlink{yang2015bayesian}{\emph{Bayesian differen-}} & time series & batch & one-shot & correlations & \emph{Pufferfish} & event & perturbation & general \\
\hyperlink{yang2015bayesian}{\emph{tial privacy}}~\cite{yang2015bayesian} & & & & & & & (Laplace) & (Gaussian) \\ \tabucline[hdashline]{-}
\hyperlink{song2017pufferfish}{Song et al.}~\cite{song2017pufferfish} & time series & batch & one-shot & correlations & \emph{Pufferfish} & event/user & perturbation & general \\
& & & & & & & (dynamic Laplace) & (Markov) \\ \tabucline[hdashline]{-}
\hyperlink{fan2013differentially}{\textbf{Fan et al.}}~\cite{fan2013differentially} & time series & streaming & continuous & correlations & differential & event & perturbation & spatiotem- \\
& & & & & privacy & & (Laplace) & poral/serial \\ \tabucline[hdashline]{-}
\hyperlink{wang2017cts}{\emph{CTS-DP}}~\cite{wang2017cts} & time series & streaming & continuous & correlations & differential & event & perturbation & serial (autocor- \\
& & & & & privacy & & \emph{(correlated Laplace)} & relation function) \\ \tabucline[hdashline]{-}
\hyperlink{fan2014adaptive}{\emph{FAST}}~\cite{fan2014adaptive} & time series & streaming & continuous & linkage & differential & user & perturbation & - \\
& & & & & privacy & & (dynamic Laplace) & \\
\bottomrule
\caption{Summary table of reviewed privacy methods. Location specific techniques are listed in bold, the rest are not data-type specific.}
\label{tab:related}
\end{longtabu}
}

View File

@ -1,4 +1,74 @@
\chapter{Introduction}
\label{ch:intro}
This is the introduction.
\section{Test}
\section{Introduction}
\label{sec:introduction}
Data privacy is becoming an increasingly important issue both at a technical and a societal level and introduces various challenges ranging from the way we share and publish data sets to the way we use online and mobile services.
Personal information, also described as \emph{microdata}, acquired increasing value and is in many cases used as the `currency'~\cite{economist2016data} to pay access to various services, i.e.,~users are asked to exchange their personal information with the service provided.
This is particularly true for many \emph{Location Based Services (LBS)} like Google Maps~\cite{gmaps}, Waze~\cite{waze}, etc.; these services exchange their `free' service with collecting and using user-generated data, like timestamped geolocalized information.
Besides navigation services, social media applications (e.g.,~Facebook~\cite{facebook}, Twitter~\cite{twitter}, Foursquare~\cite{foursquare}, etc.) take advantage of user-generated and user-related data, to make relevant recommendations and show personalized advertisement.
Here the location is also one of the important required private data to be shared.
Last but not least, \emph{data brokers} (e.g.,~Experian~\cite{experian}, TransUnion~\cite{transunion}, Acxiom~\cite{acxiom}, etc.) collect data from public and private resources, e.g.,~censuses, bank card transaction records, voter registration lists, etc.
Most of these data are georeferenced and contain directly or indirectly location information; protecting the location of the user has become one of the most important goals so far.
These data, on the one hand, give useful feedback to the involved users and/or services, and on the other hand, provide valuable information to internal/external analysts.
While these activities happen within the boundaries of the law~\cite{tankard2016gdpr}, it is important to be able to anonymize the corresponding data before sharing and to take into account the possibility of correlating, linking and crossing diverse independent data sets.
Especially the latter is becoming quite important in the era of Big Data, where the existence of diverse linked data sets is one of the promises (for example, one can refer to the discussion on Linked Open Data~\cite{efthymiou2015big}).
The ability to extract additional information impacts the ways we protect our data and affects the privacy guarantees we can provide.
Besides the explosion of online and mobile services, another important aspect is that a lot of these services actually rely on data provided by the users (\textit{crowdsourced} data) to function, with prominent examples efforts like Wikipedia~\cite{wiki} and OpenStreetMap~\cite{osm}.
Data from crowdsourced based applications, if not protected correctly, can be easily used to identify personal information like location or activity and thus, lead indirectly to issues of user surveillance~\cite{lyon2014surveillance}.
Nonetheless, users seem reluctant to undertake the financial burden to support the numerous services that they use~\cite{savage2013value}.
While this does not mean that the various aggregators of personal/private data are exonerated of any responsibility, it imposes the need to work within this model, providing the necessary technical solutions to increase data privacy.
\begin{figure}[tbp]
\centering
\includegraphics[width=\linewidth]{data-flow}
\caption{The usual flow of crowdsourced data harvested by publishers, anonymized, and released to data consumers.}
\label{fig:data-flow}
\end{figure}
However, providing adequate user privacy affects the utility of the data, which is associated with one of the five dimensions (known as the five \emph{`V's'}) that define Big Data: its \textit{Veracity}.
Through privacy preserving processes, in the case of \textit{microdata}, a private version, containing some synthetic data as well, is generated, where the users are not distinguishable.
In the case of \textit{statistical} data (e.g.,~the results of statistical queries over our data sets), a private version is generated by adding some kind of noise on the actual statistical values.
In both cases, we end up by affecting the quality of the published data set and in both cases, the privacy and the utility of the `noisy' private output are two contrasting desiderata, which need to be measured and balanced.
As a matter of fact, the added noise is greater when we consider external threats (e.g.,~linked or correlated data), in order to ensure the same level of protection, inevitably affecting the utility of the data set.
For this reason, the abundance of external information in the Big Data era is something that need to be taken into account, in the traditional processing flow, shown in Figure~\ref{fig:data-flow}.
While we still need to go through the preprocessing step to make the data private before releasing it for public use, we should make sure that the quality/privacy ratio is re-stabilized.
This discussion introduces the importance of being able to correctly choose the proper privacy algorithms that would allow users to provide private copies of their data with some guarantees.
Finding a balance between privacy and data utility is a task far from trivial for any privacy expert.
On the one hand, it is crucial to select an appropriate anonymization technique, relevant to the data set intended for public release.
On the other hand, it is equally essential to tune the selected technique according to the circumstances, e.g.,~assumptions, level of distortion, etc.~\cite{kifer2011no}.
Selecting the wrong privacy algorithm or configuring it poorly, may not only put in risk the privacy of the involved individuals, but also end up deteriorating the quality and therefore, the utility of the data set.
\begin{figure}[tbp]
\centering
\includegraphics[width=0.5\linewidth]{data-value}
\caption{Value of data to decision making over time from less than seconds to more than months~\cite{gualtieri2016perishable}.}
\label{fig:data-value}
\end{figure}
In this context, in this thesis we focus on privacy in continual publication scenarios, with an emphasis on works taking into account data correlations, since this field (i) includes the most prominent cases, like for example location privacy problems, and (ii) provides the most challenging and yet not well charted part of the privacy algorithms, since it is rather new and is increasingly complex.
The type of data in these cases require a timely processing, since usually their value decreases over time, as demonstrated in Figure~\ref{fig:data-value}.
This allows us to provide an insight into additional properties of the algorithms, like for instance if they work on streaming or real-time data, or if they take into account existing data correlations either within the data set or with external data sets.
Geospatial data commonly fall in this category; a few examples include --- but are not limited to --- data being produced while tracking the movement of individual for various purposes (where data should become private on the move and in real-time), crowdsourced data that are used to report measurements like noise or pollution (where data should become private before reaching the server), and even data items like photographs or social media posts that might include location information (where data should become private before the posts become public).
In most of these cases, the privacy preserving processes should take into account implicit correlations that exist, since data have a spatial dimension and space imposes its own restrictions.
The domain of data privacy is rather vast, and naturally several works have already been conducted on different scopes.
Subsequently, we refer the interested reader to a non-exhaustive list of relevant articles.
A group of works focuses on the family of algorithms used to make the data private.
For instance, Simi et al.~\cite{simi2017extensive} provide an extensive study of works on $k$-anonymity and Dwork~\cite{dwork2008differential} focuses on differential privacy.
Another group of works focuses on techniques that allow the execution of data mining or machine learning tasks with some privacy guarantees, e.g.,~Wang et al.~\cite{wang2009survey}, and Ji et al.~\cite{ji2014differential}.
In a more general scope, Wang et al.~\cite{wang2010privacy} offer a summary and evaluation of privacy-preserving data publishing techniques.
Additional works, look into issues around Big Data and user privacy.
Indicatively, Jain et al.~\cite{jain2016big}, and Soria-Comas et al.~\cite{soria2016big} do an examination of how Big Data conflict with preexisting concepts of private data management, and how efficiently $k$-anonymity and $\varepsilon$-differential privacy meet Big Data requirements.
Finally, there are some works that focus on the application.
E.g.,~Zhou et al.~\cite{zhou2008brief} have a focus on social networks, Christin et al.~\cite{christin2011survey} give an outline of how privacy aspects are addressed in crowdsensing applications, and Primault et al.~\cite{primault2018long} summarize privacy threats and location privacy-preserving mechanisms.
% This thesis is organized as follows: we begin by providing a general description of the field of data privacy, and the most prominent anonymization and obfuscation/noise inducing algorithms that have been proposed in the literature so far (Section~\ref{sec:background}).
% The main content of the thesis (Section~\ref{sec:main}) spans works related to the continual publication of data points, or to the republication of (or parts of) a data set along time, with regard to the privacy of the individuals involved.
% More particularly, we divide the works in two categories, based on the type of data published: microdata or statistical data.
% In all cases, we use the same set of properties to characterize the algorithms, and thus, allow to compare them.
% Finally (Section~\ref{sec:conclusion}), we put these works into perspective and discuss various future research lines in this area.

View File

@ -4,29 +4,65 @@
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{graphicx}
\usepackage{hyperref}
\usepackage[hyphenbreaks]{breakurl}
\usepackage{booktabs}
\usepackage{stmaryrd}
\usepackage[T1]{fontenc}
\usepackage{cite}
\graphicspath{ {images/} }
% packages for algorithm formatting
\usepackage{algorithm}
%\usepackage{algorithmic}
\usepackage{algpseudocode}
\usepackage[table]{xcolor}
\usepackage{amssymb,amsmath}
%custom packages
\usepackage{adjustbox}
\usepackage[normalem]{ulem}
\usepackage{geometry}
\usepackage{xcolor}
\usepackage{enumerate}
\usepackage{multirow}
\usepackage{adjustbox}
\usepackage{booktabs}
\usepackage{longtable}
\usepackage{tabu}
\newtabulinestyle{hdashline=0.5pt on 0.5pt off 2pt}
\usepackage{float}
\usepackage{caption}
\definecolor{smaragdine}{RGB}{0,159,119} % <3
\graphicspath{{graphics/}}
\DeclareGraphicsExtensions{.pdf, .png}
\begin{document}
\input{titlepage}
% \input{titlepage}
\frontmatter
\tableofcontents
\listoffigures
\listoftables
% \tableofcontents
% \listoffigures
% \listoftables
\input{acknowledgements}
% \input{acknowledgements}
\input{abstract}
% \input{abstract}
\mainmatter
\input{introduction}
% \input{introduction}
\input{background}
% \input{related}
\backmatter
\bibliographystyle{alpha}
\bibliography{bibliography}
\end{document}

90
microdata.tex Normal file
View File

@ -0,0 +1,90 @@
\section{Microdata}
\label{sec:microdata}
As observed in Table~\ref{tab:related}, privacy preserving algorithms for microdata rely on $k-$anonymity, or derivatives of it. Ganta et al.~\cite{ganta2008composition} revealed that $k$-anonymity methods are vulnerable to \emph{composition attacks}. Consequently, these attacks drew the attention of researchers, who proposed various algorithms based on $k-$anonymity, each introducing a different dimension on the problem, for instance that previous releases are known to the publisher, or that the quasi-identifiers can be formed by combining attributes in different releases. Note, however, that only one (Li et al.~\cite{li2016hybrid}) of the following works assumes \emph{independently} anonymized data sets that may not be known to the publisher in the attack model, making it more general than the rest of the works.
% \subsection{Continual data}
% \mk{Nothing to put here.}
\subsection{Data streams}
% M-invariance: towards privacy preserving re-publication of dynamic data sets
\hypertarget{xiao2007m}{Xiao et al.}~\cite{xiao2007m} consider the case when a data set is (re)published in different time-shots in an update (tuple delete, insert) manner. More precisely, they address anonymization in dynamic environments by implementing m-\emph{invariance}. In a simple $k$-anonymization (or $l$-diverse) scenario the privacy of an individual that exists in two updates can be compromised by the intersection of the set of sensitive values. In contrast, an individual who exists in a series of $m$-invariant releases, is always associated with the same set of $m$ different sensitive values. To enable the publishing of $m$-invariant data sets, artificial tuples called \emph{counterfeits} may be added in a release. To minimize the noise added to the data sets, the authors provide an algorithm with two extra desiderata: minimize the counterfeits and the quasi-identifiers' generalization level. Still, the choice of adding tuples with specific sensitive values disturbs the value distribution with a direct effect on any relevant statistics analysis.
% Preventing equivalence attacks in updated, anonymized data
In the same update setting (insert/delete), \hypertarget{he2011preventing}{He et al.}~\cite{he2011preventing} introduce another kind of attack, namely the \emph{equivalence} attack, not taken into account by the aforementioned $m$-invariance technique. The equivalence attack allows for sets of individuals (of size $e<m$) to be associated with sets of sensitive values with a probability lower than $m$, in different snap-shots. For example, through tuple deletions, we may infer that two individuals share the exact same sensitive value (thus, may be considered equivalent). In order for a snap-shop of releases to be private, they have to be both $m$-invariant and $e$-equivalent, ($e\leq m$). Subsequently, the authors propose an algorithm incorporating $m$-invariance and based on the graph optimization \emph{min cut} problem, for publishing $e$-equivalent data sets. The proposed method can achieve better levels of privacy, in comparable times and quality as $m$-invariance.
% A hybrid approach to prevent composition attacks for independent data releases
\hypertarget{li2016hybrid}{Li et al.}~\cite{li2016hybrid} identified a common characteristic in most of the privacy techniques: when anonymizing a data set all previous releases are known to the data owner. It is probable however that the releases are independent from each other, and that the data owner is unaware of these releases when anonymizing the data set. In such a setting, the previous techniques would suffer from composition attacks. The authors define this kind of adversary and propose a hybrid model for data anonymization. More precisely, the adversary knows that an individual exists in two different data sets, he has a hold of the anonymized versions, but the anonymization is done independently (i.e.,~without knowledge of the other data set) for each data set. The key idea in fighting a composition attack is to enforce the probability that the matches among tuples from the two data sets are random, linking different rather than the same individual. To do so, the proposed anonymization exploits three preprocessing steps, before applying a traditional $k$-anonymity or $l$-diversity anonymization algorithm. First, the data set is sampled so as to blur the knowledge of the existence of individuals. Then, especially in small data sets, quasi-identifiers are perturbed by noise addition, before the classical generalization step. In addition to quasi-identifiers also the sensitive values are generalized, in the case of sparse data. The danger of composition attacks is less prominent when using this method, on top of $k$-anonymity rather than without, while having comparable quality results. Moreover, the quality results are shown to be substantially better than those obtained by the use of $\varepsilon$-differential privacy. This is a good attempt to independently anonymizing multiple times a data release, however the scenario is restricted to releases over the same database schema, using the same perturbation and generalization functions.
% Continuous privacy preserving publishing of data streams
\hypertarget{zhou2009continuous}{Zhou et al.}~\cite{zhou2009continuous} introduce the problem of continuous private data publication in \emph{streams}, and propose a randomized solution based on $k-$anonymity. In their definition, they state that a private stream consists in publishing equivalence classes of size larger than or equal to $k$ containing generalized tuples from distinct persons (or identifiers in general). To create the equivalence classes they set several desiderata. Except for the size of a class, which should be larger or equal to $k$, the information loss occurred by the generalization should be low, whereas the delay in forming and publishing the class should be low as well. To achieve these they built a randomized model using the popular structure of $R-$trees, extended to accommodate data density distribution information. In this way, they achieve a better quality for the released private data: On the one hand, formed classes contain data items that are close to each other (in dense areas), while on the other hand classes with tuples of sparse areas are released as soon as possible so that the delay will remain low. This work has a special focus on publishing good quality private data. Still, it does not consider attacks where background knowledge exists, nor does it measure the privacy level achieved (other than requiring the size of the released class to be larger or equal to $k$ as in $k-$anonymity), as $\varepsilon$-differential privacy.
% Maskit: Privately releasing user context streams for personalized mobile applications
\hypertarget{gotz2012maskit}{Gotz et al.}~\cite{gotz2012maskit} developed \emph{MaskIt}, a system that interfaces the sensors of a personal device, identifies various sets of \emph{contexts} and releases a stream of privacy preserving contexts to untrusted applications installed on the device. A context is defined as the circumstances that form the setting for an event, e.g.,~`at the office', `running', etc. The users have to define the sensitive contexts that they wish to be protected and the desired level of privacy. The system models the users' various contexts and transitions between them. Temporal correlations are captured using Markov chains by taking into account historical observations. After the initialization, \emph{MaskIt} filters a stream of user contexts by checking for each context whether it is okay to be released or needs to be suppressed. More specifically, a system $A$ preserves \emph{$\delta$-privacy} against an adversary if for all possible inputs $\overrightarrow{x}$ sampled from the Markov chain $M$ with non-zero probability (i.e.~$\Pr[\overrightarrow{x}] > 0$), for all possible outputs $\overrightarrow{o}$ ($\Pr[A(\overrightarrow{x}) = \overrightarrow{o}] > 0$), for all times $t$ and all sensitive contexts $s\in S$, it satisfies the condition $\Pr[X_t = s|\overrightarrow{o}] - \Pr[X_t = s] \leq \delta$. After filtering all the elements of a given stream, an output sequence for a single day is released. The process can be repeated to publish longer context streams. The utility of the system is measured as the expected number of released contexts. Letting the user to define the privacy settings requires that the user has some certain level of relative knowledge, which is not usually the case in real life. Additionally, suppressing data can sometimes disclose more information than releasing them instead, e.g.,~releasing multiple data points around a `sensitive' area (and not inside it) is going to eventually disclose the protected area.
% PLP: Protecting location privacy against correlation analyze Attack in crowdsensing
\hypertarget{ma2017plp}{Ma et al.}~\cite{ma2017plp} propose \emph{PLP} a crowdsensing scheme that protects location privacy against adversaries that can extract spatiotemporal correlations---modeled with CRFs---from crowdsensing data. Users' context (location, sensing data) stream is filtered while long-range dependencies among locations and reported sensing data are taken into account. Sensing data are suppressed at all sensitive locations while data at insensitive locations are reported with a certain probability defined by observing the corresponding CRF model. On the one hand, the privacy of the reported data is estimated by the difference $\delta$ between the probability that a user would be at a specific location given supplementary information versus the same probability without the extra information. On the other hand, the utility of the method depends on the total amount of reported data (more is better). An estimation algorithm searches for the optimal strategy that maximizes utility while preserving a predefined privacy threshold. Although this approach allows users to define their desired privacy prerequisites, it cannot guarantee optimal protection.
\subsection{Sequential data}
% Anonymizing sequential releases
\hypertarget{wang2006anonymizing}{Wang and Fung}~\cite{wang2006anonymizing} address the problem of anonymously releasing different projections of the same data set, in subsequent timestamps. More precisely, the authors want to protect individual information that could be revealed from \emph{joining} various releases of the same data set. To do so, instead of locating the quasi-identifiers in a single release, the authors suggest that the identifiers may span the current and all previous releases of the (projections of the) data set. Then, the proposed method uses the join of the different releases on the common identifying attributes. The goal is to generalize the identifying attributes of the current release, given that previous releases are immutable. The generalization is performed in a top down manner, meaning that the attributes are initially over generalized, and step by step are specialized until they reach the point when predefined quality and privacy requirements are met. The privacy requirements, are the so-called $(X,Y)-privacy$ for a threshold $k$, meaning that the identifying attributes in $X$ are linked with at most $k$ sensitive values in $Y$, in the join of the previously released and current tables. The quality requirements can be tuned into the framework, whereas three alternatives are proposed: the reduction of the class entropy~\cite{quinlan2014c4,shannon2001mathematical}, the notion of distortion, and the discernibility~\cite{bayardo2005data}. The authors propose an algorithm for the release of a table $T1$ in the existence of a previous table $T2$, which takes into account the scalability and performance problems that a join among those two may entail. Still, when many previous releases exist, the complexity would remain high.
% Privacy by diversity in sequential releases of databases
\hypertarget{Shmueli}{Shmueli and Tassa}~\cite{shmueli2015privacy} identified the computational inefficiency of anonymously releasing a data set, taking into account previous ones, in scenarios of sequential publication. In more detail, they consider the case when in subsequent times, projections over different subsets of attributes of a table are published, and they provide an extension for attribute addition. Their algorithm can compute $l-$diverse anonymized releases (over different subsets of attributes) in parallel, by generating $l-1$ so-called \emph{fake} worlds. A fake world is generated from the base table, by randomly permutating non-identifier and sensitive values among the tuples, in such a way that minimal information loss (quality desideratum) is incurred. This is possible, partially by verifying that the permutation is done among quasi-identifiers that are similar. Then, the algorithm creates buckets of tuples with at least $l$ number of different sensitive values, in which the quasi-identifiers will then be generalized in order to achieve $l-$diversity (privacy protection desideratum). The generalization step is also conducted in a information-loss efficient way. All different releases will be $l-$diverse, because they are created assuming the same possible worlds, with which they are consistent. Tuples/attributes deletion is briefly discussed and left as open question. The paper is contrasted with a previous work~\cite{shmueli2012limiting} of the same authors, claiming that the new approach considers a stronger adversary (the adversary knows all individuals with their quasi-identifiers in the database, and not only one), and that the computation is much more efficient, as it does not have an exponential complexity w.r.t. to the number previous publications.
% Differentially private trajectory data publication
\hypertarget{chen2011differentially}{Chen et al.}~\cite{chen2011differentially} propose a non-interactive data-dependent sanitization algorithm to generate a differentially private release for trajectory data. First, a noisy \emph{prefix tree}, i.e.,~an ordered search tree data structure used to store an associative array, is constructed. Each node represents a possible location---a legit location from a set of locations that any user can be present in---of a trajectory and contains a perturbed count---the number of persons in the current location---with noise drawn from a Laplace distribution. The privacy budget is equally allocated to each level of the tree. At each level, and for every node, children nodes with non-zero number of trajectories are identified as \emph{non-empty} by observing noisy counts so as to continue expanding them. All children nodes are associated with disjoint subsets and thus, the parallel composition theorem of differential privacy can be applied. Therefore, all the available budget can be used for each node. An empty node is detected by injecting Laplace noise to its corresponding count and checking if it is less that a preset threshold $\theta=\frac{2\sqrt{2}}{\varepsilon / h}$. Where $\varepsilon$ is the available privacy budget and $h$ the height of the tree. To generate the sanitized database, it is necessary to traverse the prefix tree once in post-order. At each node, the number of terminated trajectories is calculated and corresponding copies of prefixes are sent to the output. During this process, some consistency constraints are taken into account to avoid erroneous trajectories due to the noise added previously. Namely, for any root-to-leaf path $p, \forall v_i \in p, |tr(v_i)| \leq |tr(v_{i+1})|$, where $v_i$ is a child of $v_{i+1}$, and for each node $v, |tr(v)| \geq \sum_{u \in children(v)} |tr(u)|$. The increase of the privacy budget results in less average relative error because less noise is added at each level. By increasing the height of the tree, the relative error initially decreases as more information is retained from the database. However, after a certain threshold, the increase of height can result in less available privacy budget at each level and thus more relative error due to the increased perturbation.
% Publishing trajectories with differential privacy guarantees
\hypertarget{jiang2013publishing}{Jiang et al.}~\cite{jiang2013publishing} focus on ship trajectories with known starting and terminal points. More specifically, they study several different noise addition mechanisms for publishing trajectories with differential privacy guarantees. These mechanisms include adding \emph{global} noise to the trajectory or noise to each location \emph{point} of the trajectory by sampling a noisy radius from an exponential distribution, and adding noise drawn from a Laplace distribution to each \emph{coordinate} of every location point. Upon the comparison of these different techniques, the latter offers better privacy guarantee and smaller error bound, but the resulting trajectory is noticeably distorted raising doubts about its practicality. A \emph{Sampling Distance and Direction (SDD)} mechanism is proposed to tackle the limited practicality coming from the addition of Laplace noise to the trajectory coordinates. It enables the publishing of optimal next possible trajectory point by sampling a suitable distance and direction at the current position and taking into account the ship's maximum speed constraint. The SDD mechanism outperforms other mechanisms and can maintain good utility with very high probability even while offering strong privacy guarantees.
% Anonymity for continuous data publishing
\hypertarget{fung2008anonymity}{Fung et al.}~\cite{fung2008anonymity} introduce the problem of privately releasing continuous \emph{incremental} data sets. The invariant of this kind of releases is that in every timestamp $T_i$, the records previously released in a timestamp $T_j$, where $j<i$, are released again together with a set of new records. The authors first focus in two consecutive releases and describe three classes of possible attacks. They name these attacks \emph{correspondence} attacks because they rely on the principle that all tuples from data set $D1$ correspond to a tuple in the subsequent data set $D2$. Naturally, the opposite does not hold, as tuples with a timestamp $T_2$ do not exist in $D1$. Assuming that the attacker knows the quasi-identifiers and the timestamp of the record of a person, they define the \emph{backward}, \emph{cross} and \emph{forward} (\emph{BCF}) attacks. They show that combining two individually $k-$anonymized subsequent releases using one of the aforementioned attacks can lead to `cracking' some of the records in the set of $k$ candidate tuples rendering the privacy level lower than $k$. Except for the detection of cases of compromising $BCF$ anonymity between two releases, the authors also provide an anonymization algorithm for a release $R2$ in the presence of a private release $R1$. The algorithm starts from the most possible generalized state for the quasi-identifiers of the records in $D2$. Step by step, it checks which combinations of specializations on the attributes do not violate the $BCF$ anonymity and outputs the most possible specialized version of the data set. The authors discuss how the framework extends to multiple releases and to different kinds of privacy methods (other than $k-$anonymization). It is worth noting that in order to maintain a certain quality for a release, it is essential that the delta among subsequent releases is large enough; otherwise the needed generalization level may destroy the utility of the data set.
% Protecting Locations with Differential Privacy under Temporal Correlations
\hypertarget{xiao2015protecting}{Xiao et al.}~\cite{xiao2015protecting} propose another privacy definition based on differential privacy that accounts for temporal correlations in geo-tagged data. Location changes between two consecutive timestamps are determined by temporal correlations modeled through a Markov chain. A \emph{$\delta$-location} set includes all the probable locations a user might appear excluding locations of low probability. Therefore, the true location is hidden in the resulting set in which any pairs of locations are indistinguishable and thus, the user is protected. The lower the value of $\delta$, the more locations are included and hence, the higher level of privacy is achieved. \emph{Planar Isotropic Mechanism (PIM)} is used as a perturbation mechanism to add noise to the released locations. It is proved that $l_1$-norm sensitivity fails to capture the exact sensitivity, i.e.,~the difference between any two query answers from two instances in neighboring databases, in a multidimensional space. For this reason, \emph{sensitivity hull}, an independent notion from the context of location privacy, is utilized instead. In~\cite{xiao2017loclok} they demonstrate the functionality of their system \emph{LocLok} which implements the concept of $\delta$-location. In spite of taking into account temporal correlations for identifying the next possible locations of a user, the proposed definition does not evaluate the corresponding privacy leakage.
% An adaptive geo-indistinguishability mechanism for continuous LBS queries
\hypertarget{al2018adaptive}{Al-Dhubhani et al.}~\cite{al2018adaptive} propose an adaptive privacy preserving technique which adjusts the amount of noise required to obfuscate users' location based on its correlation level with the previous (obfuscated) released locations to deal with correlation analysis attacks. Their technique is based on \emph{geo-indistinguishability}~\cite{andres2013geo}, an adaptation of differential privacy for location data, which adds controlled random noise, to users' locations, drawn from a bivariate Laplace distribution (\emph{Planar Laplace}). The system architecture considered, involves only the users and queried service providers, excluding any third-party entities. After evaluating the adversary's ability to estimate a user's position by utilizing a regression algorithm for a certain prediction window, that exploits previous location releases, noise is added accordingly. I.e., in areas with locations that present strong correlations, therefore, an adversary can predict the current value with lower estimation error, more noise is added to the released locations. The opposite stands for locations with weaker correlations. Adapting the amount of injected noise depending on the data correlation level might lead to a better performance, in terms of both privacy and utility, in the short term. However, alternating the amount of injected noise at each timestamp without taking into account the previously released data, can lead to arbitrary privacy and utility loss in the long term. Applying a filtering algorithm on the perturbed data points, prior to their release, can effectively deal with any possible data discrepancy.
% Preventing velocity-based linkage attacks in location-aware applications
\hypertarget{ghinita2009preventing}{Ghinita et al.}~\cite{ghinita2009preventing} tackle attacks to location privacy that arise from the linkage of maximum user velocity with cloaked regions, due to adversarial background knowledge, when using Location-Based Services. The proposed methods prevent the disclosure of the exact user location coordinates and bound the association probability to a certain user-defined threshold related to user-sensitive features, e.g.,~religious beliefs, health condition, etc., linked to corresponding locations, e.g.,~church, hospital, etc. The first method referred to as \emph{temporal cloaking} is achieved via either \emph{deferral} or \emph{postdating}. The former is applied by delaying the disclosure of a cloaked region that is `too far' from the previous reported region, i.e.,~impossible to have been reached based on the known maximum user speed. The latter requires to report the nearest previous cloaked region and since it is near to the actual region, the corresponding results are highly probable to be relevant. A request is usually postdated when the user-specified threshold is exceeded, otherwise, the nearest candidate region is selected and is deferred or postdated depending on the outcome of the comparison. The second method, \emph{spatial cloaking}, results in the creation of cloaked regions by first taking into account all the relevant user-specified features to the specific location (\emph{filtering of features}) and then, enlarging the area of the region to satisfy the privacy requirements (\emph{cloaking}). Finally, the region is deferred until it includes the current timestamp (\emph{safety enforcement}) similar to temporal cloaking. The final QoS, due to the privacy protection offered by the present methods, is measured in terms of the \emph{cloaked region size}, \emph{time and space error}, and \emph{failure ratio}. The cloaked region size is taken into consideration since larger regions may decrease the usability of the retrieved information. Time and space error is possible due to delayed location reporting and cloaked regions, built around past locations, that do not include the current one. Finally, failure ratio is calculated by measuring the dropped requests in cases where the specified privacy requirements are impossible to be satisfied. Considering the cloak granularity as the only privacy metric proves inadequate since it can be easily compromised in cases of low user presence around the sensitive area.
\subsection{Time series}
% Time distortion anonymization for the publication of mobility data with high utility
\hypertarget{primault2015time}{Primault et al.}~\cite{primault2015time} proposed \emph{Promesse}, an algorithm that builds on time distortion instead of location distortion, to ensure \emph{user-level} privacy when releasing trajectories. \emph{Promesse} takes as input a user's mobility trace comprising of a data set of pairs of geolocations and timestamps, and a parameter \emph{$\varepsilon$}, i.e.,~the privacy budget. Initially, regularly spaced locations are extracted and each one of them is interpolated at a distance depending on the previous location, and the value of $\varepsilon$. Then, the first and last locations of the mobility trace are removed and uniformly distributed timestamps are assigned to the remaining locations of the trajectory. In this way, the resulting trace has a smooth speed and therefore \emph{points of interest (POIs)}, i.e.,~places where the user stayed more time, e.g.,~home, work, etc., are indistinguishable by the adversaries. The present algorithm works better with fine grained data sets, because in this way it can achieve optimal geolocation and timestamp pairing. Furthermore, it can only be used offline, rendering unsuitable for most real life application scenarios.

30
related.tex Normal file
View File

@ -0,0 +1,30 @@
\chapter{Related work}
\label{ch:related}
Depending on the intended use, owners may share their data either as a whole (Microdata -- Section~\ref{sec:microdata}), or as computed statistics thereof (Statistical Data -- Section~\ref{sec:statistical}).
This is the basic division line that we set in this section.
Table~\ref{tab:related} summarizes all the works reviewed in this thesis, and provides a guide for the interested reader on some of the listed variables.
We proceed by identifying some common variables in the works of both categories, which are listed in Table~\ref{tab:related}.
There are three main columns, concerning the:
\begin{itemize}
\item \textbf{Data} --- The first part of Table~\ref{tab:related} considers the nature of the input/output data of the algorithms in its first column.
We identify here the considered kind of data from the set of categories defined in Section~\ref{subsec:data-categories}, i.e.,~stream, sequential, or time series.
Particularly, we outline the cases where spatial data are explicitly considered; nevertheless, all other algorithms could be equally applied on location data as well.
The second column of the Data part is about the model (batch or stream) of the privacy preserving process and the aggregation process, where applicable.
The third column displays the publishing scheme, as one of the defined schemes in Section~\ref{subsec:data-publishing} one shot, continuous, sequential, or incremental.
\item \textbf{Protection} --- The second part of Table~\ref{tab:related} contains four columns, namely the attack scenario, the base protection method, the acquired protection level, and the distortion applied.
The different attack scenarios are described in Section~\ref{subsec:privacy-attacks}, whereas the base protection methods along with the applied distortion method are in Section~\ref{subsec:privacy-seminal}.
The possible protection levels are: event, user, and w-event (see Section~\ref{subsec:privacy-levels}).
\item \textbf{Correlations} --- The final part of Table~\ref{tab:related} is dedicated to correlations, which is an attack model not taken into consideration by all the works.
Still, the continuous publication of data inevitably creates correlations even when they are not evident in a standalone data set.
The different kinds of correlations can be found in Section~\ref{subsec:privacy-attacks}.
\end{itemize}
\input{graphics/table-related}
\input{microdata}
\input{statistical}
\input{discussion}

113
statistical.tex Normal file
View File

@ -0,0 +1,113 @@
\section{Statistical data}
\label{sec:statistical}
When continuously publishing statistical data, usually in the form of counts, the most widely used privacy method is differential privacy, or derivatives of it, as witnessed in Table~\ref{tab:related}. We now continue in reviewing the works in this category.
% \subsection{Continual data}
% \mk{Nothing to put here.}
\subsection{Data streams}
% Private and continual release of statistics
\hypertarget{chan2011private}{Chan et al.}~\cite{chan2011private} designed a continual counting mechanism satisfying $\varepsilon$-differential privacy with poly-log error. A binary tree is constructed, where each node contains a sum of the counts in its subtree, including noise. It can be used for continual top-k queries in recommendation systems and multidimensional range queries. The mechanism provides guarantees for indefinite runtime without a priori knowledge of an upper temporal bound. It can preserve differential privacy (\emph{pan privacy}) under single or multiple unannounced \emph{intrusions}, i.e.,~snapshots of the mechanism's internal states, by adding a certain amount of noise to each active counter in memory, without incurring any loss in the asymptotic guarantees. The output of the mechanism at every timestamp is a \emph{consistent} approximate integer count, i.e.,~at each time step it increases by either 0 or 1. This makes the mechanism computationally inefficient and not easily applicable in real life scenarios.
% Differentially private real-time data release over infinite trajectory streams
\hypertarget{cao2015differentially}{Cao et al.}~\cite{cao2015differentially} developed a framework that achieves \emph{l-trajectory} protection and enables personalized user privacy, while dynamically adding noise at each timestamp that exponentially fades over time. The user can specify, in an array of size $l$, the desired protection level for each location of his/her trajectory. The proposed framework is composed of three components. As its name indicates, the \emph{Dynamic Budget Allocation} component allocates portions of the privacy budgets to the other two components; a fixed one to the \emph{Private Approximation}, and a dynamic one to the \emph{Private Publishing} component at each timestamp.
The \emph{Private Approximation} component estimates, under a utility goal and an approximation strategy, whether it is beneficial to publish approximate data or not. It chooses an appropriate previous noisy data release and republishes it, if it is similar to the real statistics planned to be published. The \emph{Private Publishing} component takes the real statistics, and timestamp of approximate data as inputs, and releases noisy data using a differential privacy mechanism that adds Laplace noise. If the timestamp of the approximate data is equal to the current timestamp, then the current data with Laplace noise are published. Otherwise, the noisy data at the timestamp of the approximate data will be republished. The utilized approximation technique is highly suitable for streaming processing and can reduce significantly the privacy budget consumption. However, the framework does not take into account privacy leakage stemming from data correlations, fact that limits considerably its applicability in real life.
% Private decayed predicate sums on streams
\hypertarget{bolot2013private}{Bolot et al.}~\cite{bolot2013private} introduce the notion of \emph{decayed privacy} in continual observation of aggregates (sums). The authors recognize the fact that monitoring applications focus more on recent events and data, therefore, the value of previous data releases exponentially fades. This leads to a schema of \emph{privacy with expiration}, according to which, recent events and data are more privacy sensitive than those preceding. Based on this, they apply \emph{decayed sum} functions for answering sliding window queries of fixed window size $w$ on data streams. Namely, (i) \emph{window} sum, which can be reduced to computing the difference of two running sums, and (ii) \emph{exponentially decayed} and (iii) \emph{polynomial decayed} sums, which estimate the sum of decayed data. For every consecutive $w$ data points, binary trees are generated, where, each node is perturbed by injecting Laplace noise with scale proportional to $w$. Instead of maintaining a binary tree for every window, the windows that span two blocks are viewed as the union of a suffix and a prefix of two consecutive trees. The proposed techniques are designed for fixed window sizes, hence, the available privacy budget must be split for answering multiple sliding window queries with various window sizes.
% PrivApprox: privacy-preserving stream analytics
\hypertarget{quoc2017privapprox}{Le Quoc et al.}~\cite{quoc2017privapprox} propose \emph{PrivApprox}, a data analytics system for privacy-preserving stream processing of distributed data sets that combines sampling and randomized response. Analysts' queries are distributed to clients via an aggregator and proxies. A randomized response is transmitted by the clients, who sample the locally available data, to the aggregator via proxies that apply (XOR-based) encryption. The combination of sampling and randomized response achieves \emph{zero-knowledge} based privacy, i.e.,~proving that they know a piece of information without actually disclosing its actual value. The aggregator aggregates the received responses and returns statistics to the analysts. For numerical queries, responses are expressed as counts within histogram buckets, whereas, for non-numeric queries, each bucket is specified by a matching rule or a regular expression. A confidence metric quantifies the results' approximation resulting from the sampling and randomization. The system employs sliding window computations over batched stream processing to handle the data stream generated by the clients. \emph{PrivApprox} achieves low latency stream processing and enables a synchronization-free distributed architecture that requires low trust to a central entity. However, the assumption that released data sets are independent, is rarely true in real life scenarios.
% Hiding in the crowd: Privacy preservation on evolving streams through correlation tracking
\hypertarget{li2007hiding}{Li et al.}~\cite{li2007hiding} attempt to tackle the problem of privacy preservation in data streams by continuously tracking data correlations. Firstly, the authors define utility, and privacy. Utility of a perturbed data stream is the inverse of the \emph{discrepancy} between the original and perturbed measurements. The discrepancy is set as the normalized \emph{Forbenius} norm, i.e.,~a matrix norm defined as the square root of the sum of the absolute squares of its elements. Privacy is the discrepancy between the original and the reconstructed data stream (from the perturbed one), and is comprised by the removed noise and the error introduced by the reconstruction. Then, correlations come into play. The data streams are continuously monitored for new tuples and trends to track correlations, and the system dynamically adds noise accordingly. More specifically, the \emph{Streaming Correlated Additive Noise} (SCAN) module is used to update the estimation of the local principal components of the original data and proportionally distribute noise along the components. Thereafter, the \emph{Streaming Correlation Online Reconstruction} (SCOR) module removes all the noise by utilizing the best linear reconstruction. Overall, the present technique offers robustness against inference attacks by adapting randomization according to data trends, but, fails to quantify the overall privacy guarantee.
% PeGaSus: Data-Adaptive Differentially Private Stream Processing
\hypertarget{chen2017pegasus}{Chen et al.}~\cite{chen2017pegasus} developed \emph{PeGaSus}, an algorithm for event-level differentially private stream processing that supports different categories of stream queries (counts, sliding window, event monitoring) over multiple stream resolutions. It consists of a \emph{perturber}, a \emph{grouper}, and a \emph{smoother} modules. The perturber consumes the incoming data stream, adds noise using part of the available privacy budget $\varepsilon$ to each data item, and outputs a stream of noisy data. The data-adaptive grouper consumes the original stream and partitions the data into well-approximated regions also using part of the available privacy budget. Finally, a query specific smoother combines the independent information produced by the perturber and the grouper, and performs post-processing by calculating the final estimates of the perturber's values for each partition created by the grouper at each timestamp. The combination of the perturber and the grouper follow the sequential composition and post-processing properties of differential privacy, thus, the resulting algorithm satisfies $\varepsilon_p$ + $\varepsilon_g$ = $\varepsilon$-differential privacy. $\varepsilon_p$ is the privacy budget used by the perturber to add noise to the data and $\varepsilon_g$ the corresponding budget used by the grouper to interfere with the user-defined deviation threshold. Nonetheless, the algorithm does not take into account past and/or future releases, thus failing to capture any related privacy leakage.
% Quantifying Differential Privacy under Temporal Correlations
\hypertarget{cao2017quantifying}{Cao et al.}~\cite{cao2017quantifying} propose a method of computing the \emph{temporal privacy leakage} of a differential privacy mechanism in the presence of temporal correlations and background knowledge. The goal of this work is to achieve event-level privacy protection and bound privacy leakage at every single time point. The temporal privacy leakage, is calculated as the sum of the \emph{backward} and \emph{forward privacy leakage} minus the privacy leakage of the mechanism, because it is counted twice in the aforementioned entities. The backward privacy leakage at any time depends on the backward privacy leakage at the previous time point, the temporal correlations, and the traditional privacy leakage of the privacy mechanism. The forward privacy leakage is calculated recursively, i.e.,~for every new time point all the previous time points are re-calculated, therefore increasing the privacy loss in the past. According to the intuition, stronger correlations result in higher privacy leakage. However, the leakage is smaller when the dimension of the transition matrix (modeling the correlations) is larger due to the fact that larger transition matrices tend to be uniform, resulting in weaker correlations.
% Differentially private event sequences over infinite streams
\hypertarget{kellaris2014differentially}{Kellaris et al.}~\cite{kellaris2014differentially} defined $w$-event privacy in the setting of periodical release of statistics (counts) in infinite streams. To achieve $w$-event privacy the authors propose two mechanisms based on sliding windows, which effectively distribute the privacy budget to sub-mechanisms (one sub-mechanism per timestamp) applied on the data of a window of the stream. Both algorithms may decide to publish or not a new noisy count for a specific timestamp, based on the similarity level of the current count with a previously published one. Moreover, both algorithms have the constraint that the total privacy budget consumed in a window is equal or less than $\varepsilon$. However, the first algorithm (Budget Distribution-BD) distributes the privacy budget in a exponential-fading manner following the assumption that in a window most of the counts remain similar. The budget of expired timestamps becomes available for the next publications (of next windows). On the contrary, the second algorithm (Budget Absorption-BA) uniformly distributes from the beginning the budget to the window's timestamps. A publication uses not only the by-default allocated budget but also the budget of non-published timestamps. In order to not exceed the limit of $\varepsilon$, adequate number of subsequent timestamps are `silenced'.
%Both algorithms are applicable to real life scenarios including traffic and website visit data.
Even though one can argue that $w$-event privacy could be achieved by user-level privacy, it is nevertheless non practical because of the rigidity of the budget allocation that would finally render the output useless.
% RescueDP: Real-time spatio-temporal crowd-sourced data publishing with differential privacy
\hypertarget{wang2016rescuedp}{Wang et al.}~\cite{wang2016rescuedp} work on the publication of real-time spatiotemporal user-generated data, utilizing differential privacy with $w$-event guarantee. Initially, \emph{RescueDP} performs dynamic \emph{grouping} of regions with small statistics according to the data trends. Then, each group passes from a \emph{perturbation} module that injects Laplace noise. Due to the grouping of the previous phase, the error by perturbation on small statistics can be eliminated, increasing the utility of the resulting statistics. A \emph{budget allocation} module distributes the available privacy budget to sampling points within any successive $w$ timestamps using an adaptive \emph{sampling} module that adjusts according to data dynamics. Non-sampled data are approximated with previously perturbed data, saving part of the available privacy budget. Finally, a \emph{Kalman filtering} module is used to improve the accuracy of the published data.
\subsection{Sequential data}
% Practical differential privacy via grouping and smoothing
\hypertarget{kellaris2013practical}{Kellaris et al.}~\cite{kellaris2013practical} pointed out that in time series, where users might contribute to an arbitrary number of aggregates, the sensitivity of the query answering function is significantly influenced by their presence/absence in the data set. Thus, the \emph{Laplace perturbation algorithm}, commonly used with differential privacy, may produce meaningless data sets. Furthermore, under such settings, the discrete Fourier transformation of the \emph{Fourier perturbation algorithm} may behave erratically and affect the utility of the outcome of the mechanism. Hence, the authors proposed a method involving \emph{grouping} and \emph{smoothing} for one-time publishing of time series of \emph{non-overlapping} counts, i.e.,~each individual contributes to one count at a time. Grouping includes separating the data set into similar clusters. The size and the similarity of the clusters is data dependent. Random grouping consumes less privacy budget, as there is minimum interaction with the original data. However, when using a grouping technique based on sampling, which has some privacy cost but produces better groups, the smoothing perturbation is decreased. During the smoothing phase, the average values for each cluster are calculated and finally, Laplace noise is added. This way, the query sensitivity becomes less dependent on each individual's data and therefore, less perturbation is required.
% Differentially private sequential data publication via variable-length n-grams
\hypertarget{chen2012differentially}{Chen et al.}~\cite{chen2012differentially} exploit a text-processing technique, the \emph{n-gram} model, i.e.,~a contiguous sequence of $n$ items from a given data sample, to retain information of a sequential data set without releasing the noisy counts of all possible sequences. Using this model allows to publish the most common $n$-grams ($n$ is typically smaller than 5) to accurately reconstruct the original data set. Privacy is enhanced by the fact that the universe of all grams with a shorter $n$ value is relatively small resulting in more common sequences. Furthermore, utility is improved by the fact that for small values of $n$ the corresponding counts are large enough to deal with noise injection and the inherent Markov assumption in the $n$-gram model. Variable-length $n$-grams are released with certain thresholds for the values of counts and tree heights, allowing to deal with the trade-off of shorter grams having less information than longer ones, but less relative error. Grams are grouped based on the similarity of their $n$ values, constructing a search tree. The process goes on until reaching the desired maximum $n$ value. Grams with smaller noisy counts have larger relative error thus, lower utility. Instead of allocating the available privacy budget based on the overall maximum height of the tree, each path is adaptively estimated based on known noisy counts. To further improve the final utility, consistency constraints are used, i.e.,~the sum of children's noisy counts has to be less or equal to their parent's noisy count, and noisy counts of leaf nodes should be within a set threshold. The proposed technique is proposed for count query and frequent sequential pattern mining scenarios.
% Differentially private publication of general time-serial trajectory data
\hypertarget{hua2015differentially}{Hua et al.}~\cite{hua2015differentially} tackle the problem of trajectories containing a small number of $n$-grams, thus, sharing few or even no identical prefixes. They propose a differentially private location generalization algorithm (exponential mechanism), for trajectory publishing, where each position in the trajectory is one record. The algorithm probabilistically partitions the locations at each timestamp with regard to their Euclidean distance from each other. Each partition is replaced by its centroid and therefore, locations belonging to closer trajectories are grouped together resulting in better utility. The algorithm is optimized for time efficiency by using classic k-means clustering. Then, the algorithm releases the new trajectories over the generalized location partitions, and their perturbed counts with noise drawn from a Laplace distribution. The process continues until the total count of the published trajectories reaches the size of the original data set. If the user's moving speed is taken into account, the total number of the possible trajectories can be limited. The authors have measured the utility of distorted spatiotemporal range queries by measuring the Hausdorff distance from the original results and concluded that the utility deterioration is within reasonable boundaries considering the offered privacy guarantees.
% Achieving differential privacy of trajectory data publishing in participatory sensing
\hypertarget{li2017achieving}{Li et al.}~\cite{li2017achieving} focus on publishing a set of trajectories where, contrary to~\cite{hua2015differentially}, each one is considered as a single entry in the data set. First, the original locations are partitioned by using k-means clustering based on their pairwise Euclidean distances. Each location partition is represented by their mean (centroid). Larger number of partitions, translates into fewer locations in each partition and thus, smaller trajectory precision loss. Before adding noise to the trajectory number, the original size of the database is approximated by randomly observing the generalized trajectories with the original ones. Then, by using a set of consistency constraints, bounded Laplace noise is generated and added to the number of each trajectory. Finally, the generalized trajectories as well as their noisy counts are released. Although this technique reduces considerably the trajectory merging time, the assumption that all trajectories in the data set are recorded at the same time points does not usually apply in real life use cases.
\subsection{Time series}
% Privacy-utility trade-off under continual observation
\hypertarget{erdogdu2015privacy}{Erdogdu et al.}~\cite{erdogdu2015privacy} consider the scenario where users generate samples at every timestamp from a time series correlated with their sensitive data. Data, that the users have chosen and are willing to privately share to a service provider, are distorted according to a \emph{privacy mapping}, i.e.,~a stochastic process and then, samples are selected for release. A \emph{distortion metric} quantifies the discrepancy of the distorted data from the original. The authors investigate both a simple attack setting where the adversary can make static assumptions only based on the so far observations that cannot be later altered, and a more complex where assumptions are affected dynamically by past and future data releases. In both cases, information leakage at a time point is quantified by a \emph{privacy metric} that measures the improvement of the adversarial inference after observing the data released at that particular point. The goal of the privacy mapping is to find a balance between the distortion and privacy metrics, i.e.,~achieving maximum released data utility while preserving privacy. Throughout the process, both batch and streaming processing schemas are considered. In order to decrease the complexity of streaming processing, the authors propose the utilization of HMMs for data dependency modeling. The assumption that users are privacy-conscious and the fact that typical smart-meter system data include only the total power usage, can drastically limit the applicability of the technique described. Last but not least, there is no proof that the proposed technique is composable.
% Bayesian Differential Privacy on Correlated Data
\hypertarget{yang2015bayesian}{Yang et al.}~\cite{yang2015bayesian} show that privacy is poorer against an adversary who has the least prior knowledge. Correlations may sometimes be negative and thus, the weakest adversary may not correspond to the largest privacy leakage. When data are correlated, according to a Gaussian correlation model, the adversary with the least prior knowledge poses the highest risk of information leakage. This is because the expected variation of the query results is enhanced by the unknown tuples and the correlations with respect to different values of the private individual. The adversaries might have different correlation structures since they could collect information from different sources. Therefore, it is necessary to consider the privacy of correlated data and arbitrary adversaries. To address this necessity, the authors extend the definition of differential privacy based in a Bayesian way, and propose a new \emph{Pufferfish} privacy definition, called \emph{Bayesian differential privacy}, to express the level of private information leakage. Additionally, they designed a general perturbation algorithm that guarantees privacy, taking into account prior knowledge of any subset of tuples in the data, when the data are correlated. Data correlations are transformed in a weighted network with an arbitrary topology structure, where the correlation strength is translated into a weight value. The larger the value of the weight, the more likely is for two tuples to be close, thus, correlated. These networks are described by a Gaussian Markov random field. A Gaussian correlation model is used to accurately describe the structure of data correlations and analyze the Bayesian differential privacy of the perturbation algorithm on the basis of this model. This model is extended to a more general one by adding a prior distribution to each tuple, so that it forms a Gaussian joint distribution on all tuples. The uncertain query answer is connected with the given tuples in a Bayesian way. The perturbation mechanism calculates the potential leakage for the strongest adversaries and applies noise proportional to the maximum privacy leakage coefficient. On the downside, the proposed solution is not suitable for applications that require online processing for real-time statistics.
% Pufferfish Privacy Mechanisms for Correlated Data
\hypertarget{song2017pufferfish}{Song et al.}~\cite{song2017pufferfish} propose the \emph{Wasserstein mechanism}, a technique that can apply to any general instantiation of \emph{Pufferfish}. It adds noise proportional to the \emph{sensitivity} of a query $F$ depending on the worst case distance between the distributions $P(F(X)|s_i,d)$ and $P(F(X)|s_j,d)$ for a variable $X$, a pair of secrets $(s_i,s_j)$, and an evolution scenario $d$. The worst case distance between those two distributions is calculated by the \emph{Wasserstein metric} function. The noise is drawn from a Laplace distribution with parameter equal to the quotient resulting from the division of the maximum Wasserstein distance of the distributions of all the pairs of secrets, by the available privacy budget $\epsilon$. For optimization purposes, the authors consider a more restricted setting, where data correlations, represented by evolution scenario $d$, are modeled by using \emph{Bayesian networks}. Dependencies are calculated by the \emph{Markov quilt mechanism}, a generalization of the \emph{Markov blanket mechanism} where the dependent nodes of any node consist of its parents, its children, and the other parents of its children. The present technique excels at data sets generated by monitoring applications or network, however, it fails to apply in online settings.
% Differentially private multi-dimensional time series release for traffic monitoring
\hypertarget{fan2013differentially}{Fan et al.}~\cite{fan2013differentially} propose a real-time framework for releasing differentially private multi-dimensional traffic monitoring data. Data at every timestamp are injected with noise, drawn from a Laplace distribution, by the \emph{Perturbation} module. The perturbed data are post-processed by the \emph{Estimation} module to produce a more accurate released version. Domain knowledge, e.g.,~road network and density, is utilized by the \emph{Modeling/Aggregation} module in two ways. On one hand, an internal time series model is estimated for each location to improve the utility of perturbation's outcome by performing a posterior estimation that utilizes \emph{Gaussian} approximation and \emph{Kalman} filtering. On the other hand, data sparsity is reduced by grouping neighboring locations based on \emph{Quadtree}. All modules have a bidirectional interaction between them. Although data correlations between timestamps are taken into account to improve the released data utility, the corresponding privacy leakage is not calculated. Furthermore,The adoption of sampling during the data processing could further improve the budget allocation procedure.
% CTS-DP: publishing correlated time-series data via differential privacy}
\hypertarget{wang2017cts}{Wang et al.}~\cite{wang2017cts} defined \emph{CTS-DP}, a correlated time-series data publication method based on differential privacy by enforcing \emph{Series-Indistinguishability} and implementing a \emph{correlated Laplace mechanism (CLM)}. \emph{CTS-DP} deals with the shortcomings of independent and~\emph{identically distributed (IID) noise}. Under the presence of correlations, IID noise offers inadequate protection since by applying refinement methods, e.g.,~filtering, one can remove it. Therefore, more noise must be introduced to make up for the amount of noise that is possible to be removed, thus, diminishing data utility. First, \emph{Series-Indistinguishability} is defined which renders the statistical characteristics of the original and noise series indistinguishable. After the Series-Indistinguishability is defined, the autocorrelation function of the noise series is derived. Second, a CLM uses four Gauss white noise series passed through a linear system to produce a correlated Laplace noise series according to their autocorrelation function. However, the privacy leakage stemming from data correlations is not estimated.
% An Adaptive Approach to Real-Time Aggregate Monitoring With Differential Privacy
\hypertarget{fan2014adaptive}{Fan et al.} propose FAST~\cite{fan2014adaptive}, an adaptive system that allows the release of real-time aggregate time series under user-level differential privacy. These were achieved by using a \emph{sampling}, a \emph{perturbation}, and a \emph{filtering} module. The sampling module samples on an adaptive rate the aggregates to be perturbed. The perturbation module adds noise to each sampled point according to the allocated privacy budget. The filtering module receives the perturbed point and the original one, and generates a posterior estimate, which is finally released. The error between the perturbed and the released (posterior estimate) point is used to adapt the sampling rate; the sampling frequency is increased when data is going through rapid changes and vice-versa. Thus, depending on the adjusted sampling rate, not every single data point is perturbed, saving in this way the available privacy budget. Although, temporal correlations of the processed time series are considered, the corresponding privacy leakage is not calculated.

View File

@ -1,59 +1,56 @@
\begin{titlepage}
\centering
% \includegraphics[width=0.15\textwidth]{example-image-1x1}\par\vspace{1cm}
{\LARGE Quality \& Privacy in User-generated Big Data:\par
Algorithms \& Techniques\par}
\vspace{2cm}
{\footnotesize PRESENTED THE September X, 2019
% PRÉSENTÉE LE X Septembre 2019
\par}
\vspace{0.5cm}
{\footnotesize AT THE FACULTY OF INFORMATION \& COMMUNICATION SCIENCES \& TECHNOLOGIES
% À LA FACULTÉ DES SCIENCES ET TECHNOLOGIES DE L'INFORMATION ET DE LA COMMUNICATION
\par}
{\footnotesize ETIS LAB
% LABORATOIRE ETIS
\par}
{\footnotesize DOCTORAL PROGRAMME IN SCIENCES \& TECHNOLOGIES
% PROGRAMME DOCTORAL EN SCIENCES ET INGÉNIERIE
\par}
\vspace{1cm}
{\large PARIS-SEINE UNIVERSITY
% UNIVERSITÉ PARIS-SEINE
\par}
\vspace{0.5cm}
{\footnotesize FOR OBTAINING THE GRADE OF DOCTOR OF SCIENCES
% POUR LOBTENTION DU GRADE DE DOCTEUR ÈS SCIENCES
\par}
\vspace{2cm}
{\footnotesize BY
% PAR
\par}
\vspace{0.5cm}
{\large Manos KATSOMALLOS\par}
\vfill
\centering
% \includegraphics[width=0.15\textwidth]{example-image-1x1}\par\vspace{1cm}
{\LARGE Quality \& Privacy in User-generated Big Data:\par
Algorithms \& Techniques\par}
\vspace{2cm}
{\footnotesize PRESENTED THE September X, 2019
% PRÉSENTÉE LE X Septembre 2019
\par}
\vspace{0.5cm}
{\footnotesize AT THE FACULTY OF INFORMATION \& COMMUNICATION SCIENCES \& TECHNOLOGIES
% À LA FACULTÉ DES SCIENCES ET TECHNOLOGIES DE L'INFORMATION ET DE LA COMMUNICATION
\par}
{\footnotesize ETIS LAB
% LABORATOIRE ETIS
\par}
{\footnotesize DOCTORAL PROGRAMME IN SCIENCES \& TECHNOLOGIES
% PROGRAMME DOCTORAL EN SCIENCES ET INGÉNIERIE
\par}
\vspace{1cm}
{\large PARIS-SEINE UNIVERSITY
% UNIVERSITÉ PARIS-SEINE
\par}
\vspace{0.5cm}
{\footnotesize FOR OBTAINING THE GRADE OF DOCTOR OF SCIENCES
% POUR LOBTENTION DU GRADE DE DOCTEUR ÈS SCIENCES
\par}
\vspace{2cm}
{\footnotesize BY
% PAR
\par}
\vspace{0.5cm}
{\large Manos KATSOMALLOS\par}
\vfill
{\footnotesize
accepted on the proposal of the jury:
% acceptée sur proposition du jury :
\par
\vspace{0.5cm}
Dr. ..., jury president
% président du jury
\par
Dr. ..., jury member\par
Dr. Dimitris Kotzinos, supervisor
% directeur de thèse
\par
Dr. Katerina (Aikaterini) Tzompanaki, co-supervisor
% co-directrice de thèse
\par
Dr. Vassilis Christophides, co-supervisor
% co-directeur de thèse
\par
Dr. ..., rapporteur\par}
\vfill
% Bottom of the page
\includegraphics[height=1.5cm]{logo-universite-paris-seine}\par
{\footnotesize France\par
2019}
accepted on the proposal of the jury:
% acceptée sur proposition du jury :
\par
\vspace{0.5cm}
Dr. ..., jury president
% président du jury
\par
Dr. ..., jury member\par
Dr. Dimitris Kotzinos, supervisor
% directeur de thèse
\par
Dr. Katerina (Aikaterini) Tzompanaki, co-supervisor
% co-directrice de thèse
\par
Dr. ..., rapporteur\par}
\vfill
% Bottom of the page
\includegraphics[height=1.5cm]{logo-universite-paris-seine}\par
{\footnotesize France\par
2019}
\end{titlepage}