From 5767899ecebf9a7078da9f9ad89ed580eeb96cb5 Mon Sep 17 00:00:00 2001 From: Manos Date: Fri, 15 Oct 2021 09:01:14 +0200 Subject: [PATCH] introduction: Review --- text/introduction/main.tex | 22 ++++++++++------------ 1 file changed, 10 insertions(+), 12 deletions(-) diff --git a/text/introduction/main.tex b/text/introduction/main.tex index 5971fb0..0a3ce93 100644 --- a/text/introduction/main.tex +++ b/text/introduction/main.tex @@ -1,12 +1,13 @@ \chapter{Introduction} \label{ch:intro} -Data privacy is becoming an increasingly important issue both at a technical and at a societal level, and introduces various challenges ranging from the way we share and publish data sets to the way we use online and mobile services. +Data privacy is becoming an increasingly important issue, both at a technical and at a societal level, and introduces various challenges ranging from the way we share and publish data sets to the way we use online and mobile services. Personal information, also described as \emph{microdata}, acquired increasing value and are in many cases used as the `currency'~\cite{economist2016data} to pay for access to various services, i.e.,~users are asked to exchange their personal information with the service provided. -This is particularly true for many \emph{Location-Based Services} (LBSs), e.g.,~Google Maps~\cite{gmaps}, Waze~\cite{waze}, etc.; these services exchange their `free' service with collecting and using user-generated data, such as timestamped geolocalized information. -Besides navigation and location-based services, social media applications (e.g.,~Facebook~\cite{facebook}, Twitter~\cite{twitter}, Foursquare~\cite{foursquare}, etc.) take advantage of user-generated and user-related data, to make relevant recommendations and show personalized advertisement. +This is particularly true for many \emph{Location-Based Services} (LBSs), e.g.,~Google Maps~\cite{gmaps}, Waze~\cite{waze}, etc. +These services exchange their `free' service with collecting and using user-generated data, such as timestamped geolocalized information. +Besides navigation and location-based services, social media applications, e.g.,~Facebook~\cite{facebook}, Twitter~\cite{twitter}, Foursquare~\cite{foursquare}, etc. take advantage of user-generated and user-related data, to make relevant recommendations and show personalized advertisement. In this case, the location is also part of the important required personal data to be shared. -Last but not least, \emph{data brokers} (e.g.,~Experian~\cite{experian}, TransUnion~\cite{transunion}, Acxiom~\cite{acxiom}, etc.) collect data from public and private resources, e.g.,~censuses, bank card transaction records, voter registration lists, etc. +Last but not least, \emph{data brokers}, e.g.,~Experian~\cite{experian}, TransUnion~\cite{transunion}, Acxiom~\cite{acxiom}, etc. collect data from public and private resources, e.g.,~censuses, bank card transaction records, voter registration lists, etc. Most of these data are georeferenced and contain directly or indirectly location information; protecting the location of the user has become one of the most important privacy goals so far. These different sources and types of data, on the one hand give useful feedback to the involved users and/or services, and on the other hand, when combined together, provide valuable information to various internal/external analytical services. @@ -21,11 +22,11 @@ Data from crowdsourced based applications, if not protected correctly, can be ea Privacy-preserving processes usually introduce noise in the original or the aggregated data set in order to hide the sensitive information. In the case of \emph{microdata}, a privacy-protected version, containing some synthetic data as well, is generated with the intrinsic goal to make the users indistinguishable. -In the case of \emph{statistical} data (i.e.,~the results of statistical queries over the original data sets), a privacy-protected version is generated by adding noise on the actual statistical values. +In the case of \emph{statistical} data, i.e.,~the results of statistical queries over the original data sets,, a privacy-protected version is generated by adding noise on the actual statistical values. In both cases, we end up affecting the quality of the published data set. -The privacy and the utility of the `noisy' output are two contrasting desiderata, which need to be measured and balanced. -Furthermore, if we want to account for external additional information (e.g.,~linked or correlated data) and at the same time to ensure the same level of protection, we need to add additional noise, inevitably deteriorating the quality of the output. -This problem becomes particularly pertinent in the Big Data era, as the quality or \emph{Veracity} is one of the five dimensions (known as the five \emph{`V's'}) that define Big Data, and where there is an abundance of external information that cannot be ignored. +The privacy and the utility of the `noisy' output are two contrasting desiderata which need to be measured and balanced. +Furthermore, if we want to account for external additional information, e.g.,~linked or correlated data, and at the same time to ensure the same level of protection, we need to add additional noise, which inevitably deteriorates the quality of the output. +This problem becomes particularly pertinent in the Big Data era, as the quality or \emph{Veracity} is one of the five dimensions (known as the five \emph{`V's'}) that define Big Data and where there is an abundance of external information that cannot be ignored. Since this needs to be taken into account \emph{prior} to the publishing of the data set or the aggregated statistics there of, introducing external information into privacy-preserving techniques becomes part of the traditional processing flow while keeping an acceptable quality to privacy ratio. As we can observe in the examples mentioned above, there are many cases where data are not protected at source (what is also described as \emph{local} data privacy protection) for various reasons, e.g.,~the users do not want to pay extra, it is impossible due to technical complexity, because the quality of the expected service will be deteriorated, etc. @@ -37,7 +38,6 @@ Moreover, privacy-preserving algorithms are designed specifically for data publi In that respect, we need to be able to correctly choose the proper privacy algorithm(s), which would allow users to share protected copies of their data with some guarantees. The selection process is far from trivial, since it is essential to: \begin{enumerate} - \itemsep-0.25em \item select an appropriate privacy-preserving technique, relevant to the data set intended for public release; \item understand the different requirements imposed by the selected technique and tune the different parameters according to the circumstances of the use case based on, e.g.,~assumptions, level of distortion, etc.~\cite{kifer2011no}; \item get the necessary balance between privacy and data utility, which is a significant task for any privacy algorithm as well as any privacy expert. @@ -51,15 +51,13 @@ Selecting the wrong privacy algorithm or configuring it poorly may put at risk t \label{fig:data-value} \end{figure} -In data privacy research, privacy in continuous data publishing scenarios is the area that is concerned by studying the privacy problems created when sensitive data are published continuously, either infinitely (e.g.,~streaming data) or by multiple continuous publications over a known period of time (e.g.,~finite time series data). +In data privacy research, privacy in continuous data publishing scenarios is the area that is concerned by studying the privacy problems created when sensitive data are published continuously, either infinitely, e.g.,~streaming data, or by multiple continuous publications over a known period of time, e.g.,~finite time series data. This specific subfield of data privacy becomes increasingly important since it: \begin{enumerate}[(i)] - \itemsep-0.25em \item includes the most prominent cases, e.g.,~location (trajectory) privacy problems, and \item provides the most challenging and yet not well charted part of the privacy algorithms since it is rather new and increasingly complex. \end{enumerate} -In this context, we seek to offer a guide that would allow its users to choose the proper algorithm(s) for their specific use case accordingly. Additionally, data in continuous data publishing use cases require a timely processing because their value usually decreases over time depending on the use case as demonstrated in Figure~\ref{fig:data-value}. For this reason, we provide an insight into time-related properties of the algorithms, e.g.,~if they work on infinite, real-time data, or if they take into consideration existing data dependencies. The importance of continuous data publishing is stressed by the fact that, commonly, many types of data have such properties, with geospatial data being a prominent case.