introduction: Review
This commit is contained in:
parent
85088d8047
commit
5767899ece
@ -1,12 +1,13 @@
|
|||||||
\chapter{Introduction}
|
\chapter{Introduction}
|
||||||
\label{ch:intro}
|
\label{ch:intro}
|
||||||
|
|
||||||
Data privacy is becoming an increasingly important issue both at a technical and at a societal level, and introduces various challenges ranging from the way we share and publish data sets to the way we use online and mobile services.
|
Data privacy is becoming an increasingly important issue, both at a technical and at a societal level, and introduces various challenges ranging from the way we share and publish data sets to the way we use online and mobile services.
|
||||||
Personal information, also described as \emph{microdata}, acquired increasing value and are in many cases used as the `currency'~\cite{economist2016data} to pay for access to various services, i.e.,~users are asked to exchange their personal information with the service provided.
|
Personal information, also described as \emph{microdata}, acquired increasing value and are in many cases used as the `currency'~\cite{economist2016data} to pay for access to various services, i.e.,~users are asked to exchange their personal information with the service provided.
|
||||||
This is particularly true for many \emph{Location-Based Services} (LBSs), e.g.,~Google Maps~\cite{gmaps}, Waze~\cite{waze}, etc.; these services exchange their `free' service with collecting and using user-generated data, such as timestamped geolocalized information.
|
This is particularly true for many \emph{Location-Based Services} (LBSs), e.g.,~Google Maps~\cite{gmaps}, Waze~\cite{waze}, etc.
|
||||||
Besides navigation and location-based services, social media applications (e.g.,~Facebook~\cite{facebook}, Twitter~\cite{twitter}, Foursquare~\cite{foursquare}, etc.) take advantage of user-generated and user-related data, to make relevant recommendations and show personalized advertisement.
|
These services exchange their `free' service with collecting and using user-generated data, such as timestamped geolocalized information.
|
||||||
|
Besides navigation and location-based services, social media applications, e.g.,~Facebook~\cite{facebook}, Twitter~\cite{twitter}, Foursquare~\cite{foursquare}, etc. take advantage of user-generated and user-related data, to make relevant recommendations and show personalized advertisement.
|
||||||
In this case, the location is also part of the important required personal data to be shared.
|
In this case, the location is also part of the important required personal data to be shared.
|
||||||
Last but not least, \emph{data brokers} (e.g.,~Experian~\cite{experian}, TransUnion~\cite{transunion}, Acxiom~\cite{acxiom}, etc.) collect data from public and private resources, e.g.,~censuses, bank card transaction records, voter registration lists, etc.
|
Last but not least, \emph{data brokers}, e.g.,~Experian~\cite{experian}, TransUnion~\cite{transunion}, Acxiom~\cite{acxiom}, etc. collect data from public and private resources, e.g.,~censuses, bank card transaction records, voter registration lists, etc.
|
||||||
Most of these data are georeferenced and contain directly or indirectly location information; protecting the location of the user has become one of the most important privacy goals so far.
|
Most of these data are georeferenced and contain directly or indirectly location information; protecting the location of the user has become one of the most important privacy goals so far.
|
||||||
|
|
||||||
These different sources and types of data, on the one hand give useful feedback to the involved users and/or services, and on the other hand, when combined together, provide valuable information to various internal/external analytical services.
|
These different sources and types of data, on the one hand give useful feedback to the involved users and/or services, and on the other hand, when combined together, provide valuable information to various internal/external analytical services.
|
||||||
@ -21,11 +22,11 @@ Data from crowdsourced based applications, if not protected correctly, can be ea
|
|||||||
|
|
||||||
Privacy-preserving processes usually introduce noise in the original or the aggregated data set in order to hide the sensitive information.
|
Privacy-preserving processes usually introduce noise in the original or the aggregated data set in order to hide the sensitive information.
|
||||||
In the case of \emph{microdata}, a privacy-protected version, containing some synthetic data as well, is generated with the intrinsic goal to make the users indistinguishable.
|
In the case of \emph{microdata}, a privacy-protected version, containing some synthetic data as well, is generated with the intrinsic goal to make the users indistinguishable.
|
||||||
In the case of \emph{statistical} data (i.e.,~the results of statistical queries over the original data sets), a privacy-protected version is generated by adding noise on the actual statistical values.
|
In the case of \emph{statistical} data, i.e.,~the results of statistical queries over the original data sets,, a privacy-protected version is generated by adding noise on the actual statistical values.
|
||||||
In both cases, we end up affecting the quality of the published data set.
|
In both cases, we end up affecting the quality of the published data set.
|
||||||
The privacy and the utility of the `noisy' output are two contrasting desiderata, which need to be measured and balanced.
|
The privacy and the utility of the `noisy' output are two contrasting desiderata which need to be measured and balanced.
|
||||||
Furthermore, if we want to account for external additional information (e.g.,~linked or correlated data) and at the same time to ensure the same level of protection, we need to add additional noise, inevitably deteriorating the quality of the output.
|
Furthermore, if we want to account for external additional information, e.g.,~linked or correlated data, and at the same time to ensure the same level of protection, we need to add additional noise, which inevitably deteriorates the quality of the output.
|
||||||
This problem becomes particularly pertinent in the Big Data era, as the quality or \emph{Veracity} is one of the five dimensions (known as the five \emph{`V's'}) that define Big Data, and where there is an abundance of external information that cannot be ignored.
|
This problem becomes particularly pertinent in the Big Data era, as the quality or \emph{Veracity} is one of the five dimensions (known as the five \emph{`V's'}) that define Big Data and where there is an abundance of external information that cannot be ignored.
|
||||||
Since this needs to be taken into account \emph{prior} to the publishing of the data set or the aggregated statistics there of, introducing external information into privacy-preserving techniques becomes part of the traditional processing flow while keeping an acceptable quality to privacy ratio.
|
Since this needs to be taken into account \emph{prior} to the publishing of the data set or the aggregated statistics there of, introducing external information into privacy-preserving techniques becomes part of the traditional processing flow while keeping an acceptable quality to privacy ratio.
|
||||||
|
|
||||||
As we can observe in the examples mentioned above, there are many cases where data are not protected at source (what is also described as \emph{local} data privacy protection) for various reasons, e.g.,~the users do not want to pay extra, it is impossible due to technical complexity, because the quality of the expected service will be deteriorated, etc.
|
As we can observe in the examples mentioned above, there are many cases where data are not protected at source (what is also described as \emph{local} data privacy protection) for various reasons, e.g.,~the users do not want to pay extra, it is impossible due to technical complexity, because the quality of the expected service will be deteriorated, etc.
|
||||||
@ -37,7 +38,6 @@ Moreover, privacy-preserving algorithms are designed specifically for data publi
|
|||||||
In that respect, we need to be able to correctly choose the proper privacy algorithm(s), which would allow users to share protected copies of their data with some guarantees.
|
In that respect, we need to be able to correctly choose the proper privacy algorithm(s), which would allow users to share protected copies of their data with some guarantees.
|
||||||
The selection process is far from trivial, since it is essential to:
|
The selection process is far from trivial, since it is essential to:
|
||||||
\begin{enumerate}
|
\begin{enumerate}
|
||||||
\itemsep-0.25em
|
|
||||||
\item select an appropriate privacy-preserving technique, relevant to the data set intended for public release;
|
\item select an appropriate privacy-preserving technique, relevant to the data set intended for public release;
|
||||||
\item understand the different requirements imposed by the selected technique and tune the different parameters according to the circumstances of the use case based on, e.g.,~assumptions, level of distortion, etc.~\cite{kifer2011no};
|
\item understand the different requirements imposed by the selected technique and tune the different parameters according to the circumstances of the use case based on, e.g.,~assumptions, level of distortion, etc.~\cite{kifer2011no};
|
||||||
\item get the necessary balance between privacy and data utility, which is a significant task for any privacy algorithm as well as any privacy expert.
|
\item get the necessary balance between privacy and data utility, which is a significant task for any privacy algorithm as well as any privacy expert.
|
||||||
@ -51,15 +51,13 @@ Selecting the wrong privacy algorithm or configuring it poorly may put at risk t
|
|||||||
\label{fig:data-value}
|
\label{fig:data-value}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
In data privacy research, privacy in continuous data publishing scenarios is the area that is concerned by studying the privacy problems created when sensitive data are published continuously, either infinitely (e.g.,~streaming data) or by multiple continuous publications over a known period of time (e.g.,~finite time series data).
|
In data privacy research, privacy in continuous data publishing scenarios is the area that is concerned by studying the privacy problems created when sensitive data are published continuously, either infinitely, e.g.,~streaming data, or by multiple continuous publications over a known period of time, e.g.,~finite time series data.
|
||||||
This specific subfield of data privacy becomes increasingly important since it:
|
This specific subfield of data privacy becomes increasingly important since it:
|
||||||
\begin{enumerate}[(i)]
|
\begin{enumerate}[(i)]
|
||||||
\itemsep-0.25em
|
|
||||||
\item includes the most prominent cases, e.g.,~location (trajectory) privacy problems, and
|
\item includes the most prominent cases, e.g.,~location (trajectory) privacy problems, and
|
||||||
\item provides the most challenging and yet not well charted part of the privacy algorithms since it is rather new and increasingly complex.
|
\item provides the most challenging and yet not well charted part of the privacy algorithms since it is rather new and increasingly complex.
|
||||||
\end{enumerate}
|
\end{enumerate}
|
||||||
|
|
||||||
In this context, we seek to offer a guide that would allow its users to choose the proper algorithm(s) for their specific use case accordingly.
|
|
||||||
Additionally, data in continuous data publishing use cases require a timely processing because their value usually decreases over time depending on the use case as demonstrated in Figure~\ref{fig:data-value}.
|
Additionally, data in continuous data publishing use cases require a timely processing because their value usually decreases over time depending on the use case as demonstrated in Figure~\ref{fig:data-value}.
|
||||||
For this reason, we provide an insight into time-related properties of the algorithms, e.g.,~if they work on infinite, real-time data, or if they take into consideration existing data dependencies.
|
For this reason, we provide an insight into time-related properties of the algorithms, e.g.,~if they work on infinite, real-time data, or if they take into consideration existing data dependencies.
|
||||||
The importance of continuous data publishing is stressed by the fact that, commonly, many types of data have such properties, with geospatial data being a prominent case.
|
The importance of continuous data publishing is stressed by the fact that, commonly, many types of data have such properties, with geospatial data being a prominent case.
|
||||||
|
Loading…
Reference in New Issue
Block a user