the-last-thing/text/introduction/main.tex

72 lines
10 KiB
TeX
Raw Normal View History

2017-10-05 20:52:19 +02:00
\chapter{Introduction}
\label{ch:intro}
\nnfootnote{This chapter was presented during the $11$th International Workshop on Information Search, Integration, and Personalization~\cite{kotzinos2016data} and at the DaQuaTa International Workshop~\cite{kotzinos2017data}, as well as at the S{\~a}o Paulo School of Advanced Science on Smart Cities~\cite{katsomallos2016measuring}.}
2019-03-05 19:01:18 +01:00
2021-10-15 09:01:14 +02:00
Data privacy is becoming an increasingly important issue, both at a technical and at a societal level, and introduces various challenges ranging from the way we share and publish data sets to the way we use online and mobile services.
2021-11-26 01:24:59 +01:00
Personal information, also described as \emph{microdata}, acquired increasing value and is in many cases used as the `currency'~\cite{economist2016data} to pay for access to various services, i.e.,~users are asked to exchange their personal information with the service provided.
2021-10-15 09:01:14 +02:00
This is particularly true for many \emph{Location-Based Services} (LBSs), e.g.,~Google Maps~\cite{gmaps}, Waze~\cite{waze}, etc.
These services exchange their `free' service with collecting and using user-generated data, such as timestamped geolocalized information.
2021-11-26 01:24:59 +01:00
Besides navigation and location-based services, social media applications, e.g.,~Facebook~\cite{facebook}, Twitter~\cite{twitter}, Foursquare~\cite{foursquare}, etc. take advantage of user-generated and user-related data, to make relevant recommendations and show personalized advertisements.
2021-07-16 02:31:05 +02:00
In this case, the location is also part of the important required personal data to be shared.
2021-10-15 09:01:14 +02:00
Last but not least, \emph{data brokers}, e.g.,~Experian~\cite{experian}, TransUnion~\cite{transunion}, Acxiom~\cite{acxiom}, etc. collect data from public and private resources, e.g.,~censuses, bank card transaction records, voter registration lists, etc.
2021-07-16 02:31:05 +02:00
Most of these data are georeferenced and contain directly or indirectly location information; protecting the location of the user has become one of the most important privacy goals so far.
2021-11-26 01:24:59 +01:00
On the one hand, these different sources and types of data give useful feedback to the involved users and/or services, and on the other hand, when combined together, provide valuable information to various internal/external analytical services.
2021-07-16 02:31:05 +02:00
While these activities happen within the boundaries of the law~\cite{tankard2016gdpr}, it is important to be able to protect the privacy (by anonymizing, perturbing, encrypting, etc.) the corresponding data before sharing, and to take into account the possibility of correlating, linking, and crossing diverse independent data sets.
Especially the latter is becoming quite important in the era of Big Data, where the existence of diverse linked data sets is one of the promises; as an example, one can refer to the discussion on Entity Resolution problems using Linked Open Data in~\cite{efthymiou2015big}.
In some cases, personal data might be so representative that even if de-identified, when integrated with a small amount of external data, one can trace back to their original source.
An example case is shown in~\cite{de2013unique}, where it was discovered that four mobility traces are enough to identify $95\%$ of the individuals in a data set.
The case of location is actually one of great interest in this context, since space brings its own particular constraints.
The ability to combine and correlate additional information impacts the ways we protect sensitive data and affects the privacy guarantees we can provide.
Besides the explosion of online and mobile services, another important aspect is that a lot of these services actually rely on data provided by the users (\emph{crowdsourced} data) to function, with prominent example efforts being Wikipedia~\cite{wiki}, and OpenStreetMap~\cite{osm}.
Data from crowdsourced based applications, if not protected correctly, can be easily used to identify personal information, such as location or activity, and thus lead indirectly to cases of user surveillance~\cite{lyon2014surveillance}.
Privacy-preserving processes usually introduce noise in the original or the aggregated data set in order to hide the sensitive information.
In the case of \emph{microdata}, a privacy-protected version, containing some synthetic data as well, is generated with the intrinsic goal to make the users indistinguishable.
2021-10-15 09:01:14 +02:00
In the case of \emph{statistical} data, i.e.,~the results of statistical queries over the original data sets,, a privacy-protected version is generated by adding noise on the actual statistical values.
2021-07-16 02:31:05 +02:00
In both cases, we end up affecting the quality of the published data set.
2021-10-15 09:01:14 +02:00
The privacy and the utility of the `noisy' output are two contrasting desiderata which need to be measured and balanced.
Furthermore, if we want to account for external additional information, e.g.,~linked or correlated data, and at the same time to ensure the same level of protection, we need to add additional noise, which inevitably deteriorates the quality of the output.
This problem becomes particularly pertinent in the Big Data era, as the quality or \emph{Veracity} is one of the five dimensions (known as the five \emph{`V's'}) that define Big Data and where there is an abundance of external information that cannot be ignored.
2021-07-16 02:31:05 +02:00
Since this needs to be taken into account \emph{prior} to the publishing of the data set or the aggregated statistics there of, introducing external information into privacy-preserving techniques becomes part of the traditional processing flow while keeping an acceptable quality to privacy ratio.
As we can observe in the examples mentioned above, there are many cases where data are not protected at source (what is also described as \emph{local} data privacy protection) for various reasons, e.g.,~the users do not want to pay extra, it is impossible due to technical complexity, because the quality of the expected service will be deteriorated, etc.
Thus, the burden of the privacy-preserving process falls on the various aggregators of personal/private data, who should also provide the necessary technical solutions to ensure data privacy for every user (what is also described as \emph{global} data privacy protection).
The discussion so far explains and justifies the current situation in the privacy-preserving scientific area.
As a matter of fact, a wealth of algorithms have been proposed for privacy-preserving data publishing, either for microdata or statistical data.
Moreover, privacy-preserving algorithms are designed specifically for data published at one point in time (used in what we call \emph{snapshot} data publishing) or data released over or concerning a period of time (used in what we call \emph{continuous data publishing}).
In that respect, we need to be able to correctly choose the proper privacy algorithm(s), which would allow users to share protected copies of their data with some guarantees.
The selection process is far from trivial, since it is essential to:
\begin{enumerate}
\item select an appropriate privacy-preserving technique, relevant to the data set intended for public release;
\item understand the different requirements imposed by the selected technique and tune the different parameters according to the circumstances of the use case based on, e.g.,~assumptions, level of distortion, etc.~\cite{kifer2011no};
\item get the necessary balance between privacy and data utility, which is a significant task for any privacy algorithm as well as any privacy expert.
\end{enumerate}
Selecting the wrong privacy algorithm or configuring it poorly may put at risk the privacy of the involved individuals and/or end up deteriorating the quality and therefore the utility of the data set.
\begin{figure}[htp]
\centering
2021-10-22 16:28:55 +02:00
\includegraphics[width=.75\linewidth]{introduction/data-value}
2021-07-16 02:31:05 +02:00
\caption{Value of data for decision-making over time from less than seconds to more than months~\cite{gualtieri2016perishable}.}
\label{fig:data-value}
\end{figure}
2021-10-15 09:01:14 +02:00
In data privacy research, privacy in continuous data publishing scenarios is the area that is concerned by studying the privacy problems created when sensitive data are published continuously, either infinitely, e.g.,~streaming data, or by multiple continuous publications over a known period of time, e.g.,~finite time series data.
2021-07-16 02:31:05 +02:00
This specific subfield of data privacy becomes increasingly important since it:
\begin{enumerate}[(i)]
\item includes the most prominent cases, e.g.,~location (trajectory) privacy problems, and
\item provides the most challenging and yet not well charted part of the privacy algorithms since it is rather new and increasingly complex.
\end{enumerate}
Additionally, data in continuous data publishing use cases require a timely processing because their value usually decreases over time depending on the use case as demonstrated in Figure~\ref{fig:data-value}.
2021-11-26 01:24:59 +01:00
For this reason, we provide an insight into time-related properties of the algorithms, e.g.,~if they work on finite or infinite data, or if they take into consideration any underlying data dependence.
2021-07-16 02:31:05 +02:00
The importance of continuous data publishing is stressed by the fact that, commonly, many types of data have such properties, with geospatial data being a prominent case.
A few examples include---but are not limited to---data being produced while tracking the movement of individuals for various purposes (where data might also need to be privacy-protected in real-time and in a continuous fashion); crowdsourced data that are used to report measurements, such as noise or pollution (where again we have a continuous timestamped and usually georeferenced stream of data); and even isolated data items that might include location information, such as photographs or social media posts.
Typically, in such cases, we have a collection of data referring to the same individual or set of individuals over a period of time, which can also be infinite.
Additionally, in many cases, the privacy-preserving processes should take into account implicit correlations and restrictions that exist, e.g.,~space-imposed collocation or movement restrictions.
Since these data are related to most of the important applications and services that enjoy high utilization rates, privacy-preserving continuous data publishing becomes one of the emblematic problems of our time.
2019-03-05 19:01:18 +01:00
2021-07-18 17:31:05 +02:00
\input{introduction/contribution}
\input{introduction/structure}