the-last-thing/introduction.tex
2019-03-05 19:01:18 +01:00

75 lines
9.9 KiB
TeX

\chapter{Introduction}
\label{ch:intro}
\section{Introduction}
\label{sec:introduction}
Data privacy is becoming an increasingly important issue both at a technical and a societal level and introduces various challenges ranging from the way we share and publish data sets to the way we use online and mobile services.
Personal information, also described as \emph{microdata}, acquired increasing value and is in many cases used as the `currency'~\cite{economist2016data} to pay access to various services, i.e.,~users are asked to exchange their personal information with the service provided.
This is particularly true for many \emph{Location Based Services (LBS)} like Google Maps~\cite{gmaps}, Waze~\cite{waze}, etc.; these services exchange their `free' service with collecting and using user-generated data, like timestamped geolocalized information.
Besides navigation services, social media applications (e.g.,~Facebook~\cite{facebook}, Twitter~\cite{twitter}, Foursquare~\cite{foursquare}, etc.) take advantage of user-generated and user-related data, to make relevant recommendations and show personalized advertisement.
Here the location is also one of the important required private data to be shared.
Last but not least, \emph{data brokers} (e.g.,~Experian~\cite{experian}, TransUnion~\cite{transunion}, Acxiom~\cite{acxiom}, etc.) collect data from public and private resources, e.g.,~censuses, bank card transaction records, voter registration lists, etc.
Most of these data are georeferenced and contain directly or indirectly location information; protecting the location of the user has become one of the most important goals so far.
These data, on the one hand, give useful feedback to the involved users and/or services, and on the other hand, provide valuable information to internal/external analysts.
While these activities happen within the boundaries of the law~\cite{tankard2016gdpr}, it is important to be able to anonymize the corresponding data before sharing and to take into account the possibility of correlating, linking and crossing diverse independent data sets.
Especially the latter is becoming quite important in the era of Big Data, where the existence of diverse linked data sets is one of the promises (for example, one can refer to the discussion on Linked Open Data~\cite{efthymiou2015big}).
The ability to extract additional information impacts the ways we protect our data and affects the privacy guarantees we can provide.
Besides the explosion of online and mobile services, another important aspect is that a lot of these services actually rely on data provided by the users (\textit{crowdsourced} data) to function, with prominent examples efforts like Wikipedia~\cite{wiki} and OpenStreetMap~\cite{osm}.
Data from crowdsourced based applications, if not protected correctly, can be easily used to identify personal information like location or activity and thus, lead indirectly to issues of user surveillance~\cite{lyon2014surveillance}.
Nonetheless, users seem reluctant to undertake the financial burden to support the numerous services that they use~\cite{savage2013value}.
While this does not mean that the various aggregators of personal/private data are exonerated of any responsibility, it imposes the need to work within this model, providing the necessary technical solutions to increase data privacy.
\begin{figure}[tbp]
\centering
\includegraphics[width=\linewidth]{data-flow}
\caption{The usual flow of crowdsourced data harvested by publishers, anonymized, and released to data consumers.}
\label{fig:data-flow}
\end{figure}
However, providing adequate user privacy affects the utility of the data, which is associated with one of the five dimensions (known as the five \emph{`V's'}) that define Big Data: its \textit{Veracity}.
Through privacy preserving processes, in the case of \textit{microdata}, a private version, containing some synthetic data as well, is generated, where the users are not distinguishable.
In the case of \textit{statistical} data (e.g.,~the results of statistical queries over our data sets), a private version is generated by adding some kind of noise on the actual statistical values.
In both cases, we end up by affecting the quality of the published data set and in both cases, the privacy and the utility of the `noisy' private output are two contrasting desiderata, which need to be measured and balanced.
As a matter of fact, the added noise is greater when we consider external threats (e.g.,~linked or correlated data), in order to ensure the same level of protection, inevitably affecting the utility of the data set.
For this reason, the abundance of external information in the Big Data era is something that need to be taken into account, in the traditional processing flow, shown in Figure~\ref{fig:data-flow}.
While we still need to go through the preprocessing step to make the data private before releasing it for public use, we should make sure that the quality/privacy ratio is re-stabilized.
This discussion introduces the importance of being able to correctly choose the proper privacy algorithms that would allow users to provide private copies of their data with some guarantees.
Finding a balance between privacy and data utility is a task far from trivial for any privacy expert.
On the one hand, it is crucial to select an appropriate anonymization technique, relevant to the data set intended for public release.
On the other hand, it is equally essential to tune the selected technique according to the circumstances, e.g.,~assumptions, level of distortion, etc.~\cite{kifer2011no}.
Selecting the wrong privacy algorithm or configuring it poorly, may not only put in risk the privacy of the involved individuals, but also end up deteriorating the quality and therefore, the utility of the data set.
\begin{figure}[tbp]
\centering
\includegraphics[width=0.5\linewidth]{data-value}
\caption{Value of data to decision making over time from less than seconds to more than months~\cite{gualtieri2016perishable}.}
\label{fig:data-value}
\end{figure}
In this context, in this thesis we focus on privacy in continual publication scenarios, with an emphasis on works taking into account data correlations, since this field (i) includes the most prominent cases, like for example location privacy problems, and (ii) provides the most challenging and yet not well charted part of the privacy algorithms, since it is rather new and is increasingly complex.
The type of data in these cases require a timely processing, since usually their value decreases over time, as demonstrated in Figure~\ref{fig:data-value}.
This allows us to provide an insight into additional properties of the algorithms, like for instance if they work on streaming or real-time data, or if they take into account existing data correlations either within the data set or with external data sets.
Geospatial data commonly fall in this category; a few examples include --- but are not limited to --- data being produced while tracking the movement of individual for various purposes (where data should become private on the move and in real-time), crowdsourced data that are used to report measurements like noise or pollution (where data should become private before reaching the server), and even data items like photographs or social media posts that might include location information (where data should become private before the posts become public).
In most of these cases, the privacy preserving processes should take into account implicit correlations that exist, since data have a spatial dimension and space imposes its own restrictions.
The domain of data privacy is rather vast, and naturally several works have already been conducted on different scopes.
Subsequently, we refer the interested reader to a non-exhaustive list of relevant articles.
A group of works focuses on the family of algorithms used to make the data private.
For instance, Simi et al.~\cite{simi2017extensive} provide an extensive study of works on $k$-anonymity and Dwork~\cite{dwork2008differential} focuses on differential privacy.
Another group of works focuses on techniques that allow the execution of data mining or machine learning tasks with some privacy guarantees, e.g.,~Wang et al.~\cite{wang2009survey}, and Ji et al.~\cite{ji2014differential}.
In a more general scope, Wang et al.~\cite{wang2010privacy} offer a summary and evaluation of privacy-preserving data publishing techniques.
Additional works, look into issues around Big Data and user privacy.
Indicatively, Jain et al.~\cite{jain2016big}, and Soria-Comas et al.~\cite{soria2016big} do an examination of how Big Data conflict with preexisting concepts of private data management, and how efficiently $k$-anonymity and $\varepsilon$-differential privacy meet Big Data requirements.
Finally, there are some works that focus on the application.
E.g.,~Zhou et al.~\cite{zhou2008brief} have a focus on social networks, Christin et al.~\cite{christin2011survey} give an outline of how privacy aspects are addressed in crowdsensing applications, and Primault et al.~\cite{primault2018long} summarize privacy threats and location privacy-preserving mechanisms.
% This thesis is organized as follows: we begin by providing a general description of the field of data privacy, and the most prominent anonymization and obfuscation/noise inducing algorithms that have been proposed in the literature so far (Section~\ref{sec:background}).
% The main content of the thesis (Section~\ref{sec:main}) spans works related to the continual publication of data points, or to the republication of (or parts of) a data set along time, with regard to the privacy of the individuals involved.
% More particularly, we divide the works in two categories, based on the type of data published: microdata or statistical data.
% In all cases, we use the same set of properties to characterize the algorithms, and thus, allow to compare them.
% Finally (Section~\ref{sec:conclusion}), we put these works into perspective and discuss various future research lines in this area.