Put text in separate directory
This commit is contained in:
26
text/abstract.tex
Normal file
26
text/abstract.tex
Normal file
@ -0,0 +1,26 @@
|
||||
\chapter{Abstract}
|
||||
\label{ch:abs}
|
||||
|
||||
Sensors, portable devices, and location-based services, generate massive amounts of geo-tagged, and/or location- and user-related data on a daily basis.
|
||||
The manipulation of such data is useful in numerous application domains, e.g.,~healthcare, intelligent buildings, and traffic monitoring, to name a few.
|
||||
A high percentage of these data carry information of users' activities and other personal details, and thus their manipulation and sharing arise concerns about the privacy of the individuals involved.
|
||||
To enable the secure---from the users' privacy perspective---data sharing, researchers have already proposed various seminal techniques for the protection of users' privacy.
|
||||
However, the continuous fashion in which data are generated nowadays, and the high availability of external sources of information, pose more threats and add extra challenges to the problem.
|
||||
|
||||
% Survey
|
||||
In the first part, we visit the works done on data privacy for continuous data publishing, and report on the proposed solutions, with a special focus on solutions concerning location or geo-referenced data.
|
||||
As a matter of fact, a wealth of algorithms have been proposed for privacy-preserving data publishing, either for microdata or statistical data.
|
||||
In this context, this part seeks to offer a guide that would allow its users to choose the proper algorithm(s) for their specific use case accordingly.
|
||||
We provide an insight into time-related properties of the algorithms, e.g.,~if they work on infinite, real-time data, or if they take into consideration existing data dependencies.
|
||||
|
||||
|
||||
% Landmarks
|
||||
In the second part, we argue that in continuous data publishing, events are not equally significant in terms of privacy, and hence they should affect the privacy-preserving processing.
|
||||
Differential privacy is a well-established paradigm in privacy-preserving time series publishing.
|
||||
Different schemes exist, protecting either a single timestamp, or all the data per user or per window in the time series, considering however all timestamps as equally significant.
|
||||
In this work, we propose a novel configurable privacy scheme, \emph{\thething} privacy, which takes into account significant events (\emph{\thethings}) in the time series and allocates the available privacy budget accordingly.
|
||||
We design two privacy models that guarantee {\thething} privacy and validate our proposal on real and synthetic data sets.
|
||||
|
||||
|
||||
\paragraph{Keywords:}
|
||||
information privacy, continuous data publishing, crowdsensing, privacy-preserving data processing
|
11
text/acknowledgements.tex
Normal file
11
text/acknowledgements.tex
Normal file
@ -0,0 +1,11 @@
|
||||
\chapter{Acknowledgements}
|
||||
\label{ch:ack}
|
||||
|
||||
Upon the completion of my thesis, I would like to express my deep gratitude to my research supervisors for their patient guidance, enthusiastic encouragement and useful critiques of this research work.
|
||||
Besides my advisors, I would like to thank the reporters as well as the rest of the jury for their invaluable contribution.
|
||||
|
||||
A special thanks to my department’s faculty, staff and fellow students for their valuable assistance whenever needed and for creating a pleasant and creative environment during my studies.
|
||||
|
||||
Last but not least, I wish to thank my family and friends for their unconditional support and encouragement all these years.
|
||||
|
||||
Cergy-Pontoise, MM DD, 2019
|
1913
text/bibliography.bib
Normal file
1913
text/bibliography.bib
Normal file
File diff suppressed because it is too large
Load Diff
10
text/conclusion.tex
Normal file
10
text/conclusion.tex
Normal file
@ -0,0 +1,10 @@
|
||||
\chapter{Conclusion and future work}
|
||||
\label{ch:con}
|
||||
|
||||
|
||||
\section{Thesis summary}
|
||||
\label{sec:sum-thesis}
|
||||
|
||||
|
||||
\section{Perspectives}
|
||||
\label{sec:persp}
|
77
text/introduction.tex
Normal file
77
text/introduction.tex
Normal file
@ -0,0 +1,77 @@
|
||||
\chapter{Introduction}
|
||||
\label{ch:intro}
|
||||
|
||||
Data privacy is becoming an increasingly important issue both at a technical and at a societal level, and introduces various challenges ranging from the way we share and publish data sets to the way we use online and mobile services.
|
||||
Personal information, also described as \emph{microdata}, acquired increasing value and are in many cases used as the `currency'~\cite{economist2016data} to pay for access to various services, i.e.,~users are asked to exchange their personal information with the service provided.
|
||||
This is particularly true for many \emph{Location-Based Services} (LBSs), e.g.,~Google Maps~\cite{gmaps}, Waze~\cite{waze}, etc.; these services exchange their `free' service with collecting and using user-generated data, such as timestamped geolocalized information.
|
||||
Besides navigation and location-based services, social media applications (e.g.,~Facebook~\cite{facebook}, Twitter~\cite{twitter}, Foursquare~\cite{foursquare}, etc.) take advantage of user-generated and user-related data, to make relevant recommendations and show personalized advertisement.
|
||||
In this case, the location is also part of the important required personal data to be shared.
|
||||
Last but not least, \emph{data brokers} (e.g.,~Experian~\cite{experian}, TransUnion~\cite{transunion}, Acxiom~\cite{acxiom}, etc.) collect data from public and private resources, e.g.,~censuses, bank card transaction records, voter registration lists, etc.
|
||||
Most of these data are georeferenced and contain directly or indirectly location information; protecting the location of the user has become one of the most important privacy goals so far.
|
||||
|
||||
These different sources and types of data, on the one hand give useful feedback to the involved users and/or services, and on the other hand, when combined together, provide valuable information to various internal/external analytical services.
|
||||
While these activities happen within the boundaries of the law~\cite{tankard2016gdpr}, it is important to be able to protect the privacy (by anonymizing, perturbing, encrypting, etc.) the corresponding data before sharing, and to take into account the possibility of correlating, linking, and crossing diverse independent data sets.
|
||||
Especially the latter is becoming quite important in the era of Big Data, where the existence of diverse linked data sets is one of the promises; as an example, one can refer to the discussion on Entity Resolution problems using Linked Open Data in~\cite{efthymiou2015big}.
|
||||
In some cases, personal data might be so representative that even if de-identified, when integrated with a small amount of external data, one can trace back to their original source.
|
||||
An example case is shown in~\cite{de2013unique}, where it was discovered that four mobility traces are enough to identify $95\%$ of the individuals in a data set.
|
||||
The case of location is actually one of great interest in this context, since space brings its own particular constraints.
|
||||
The ability to combine and correlate additional information impacts the ways we protect sensitive data and affects the privacy guarantees we can provide.
|
||||
Besides the explosion of online and mobile services, another important aspect is that a lot of these services actually rely on data provided by the users (\emph{crowdsourced} data) to function, with prominent example efforts being Wikipedia~\cite{wiki}, and OpenStreetMap~\cite{osm}.
|
||||
Data from crowdsourced based applications, if not protected correctly, can be easily used to identify personal information, such as location or activity, and thus lead indirectly to cases of user surveillance~\cite{lyon2014surveillance}.
|
||||
|
||||
Privacy-preserving processes usually introduce noise in the original or the aggregated data set in order to hide the sensitive information.
|
||||
In the case of \emph{microdata}, a privacy-protected version, containing some synthetic data as well, is generated with the intrinsic goal to make the users indistinguishable.
|
||||
In the case of \emph{statistical} data (i.e.,~the results of statistical queries over the original data sets), a privacy-protected version is generated by adding noise on the actual statistical values.
|
||||
In both cases, we end up affecting the quality of the published data set.
|
||||
The privacy and the utility of the `noisy' output are two contrasting desiderata, which need to be measured and balanced.
|
||||
Furthermore, if we want to account for external additional information (e.g.,~linked or correlated data) and at the same time to ensure the same level of protection, we need to add additional noise, inevitably deteriorating the quality of the output.
|
||||
This problem becomes particularly pertinent in the Big Data era, as the quality or \emph{Veracity} is one of the five dimensions (known as the five \emph{`V's'}) that define Big Data, and where there is an abundance of external information that cannot be ignored.
|
||||
Since this needs to be taken into account \emph{prior} to the publishing of the data set or the aggregated statistics there of, introducing external information into privacy-preserving techniques becomes part of the traditional processing flow while keeping an acceptable quality to privacy ratio.
|
||||
|
||||
As we can observe in the examples mentioned above, there are many cases where data are not protected at source (what is also described as \emph{local} data privacy protection) for various reasons, e.g.,~the users do not want to pay extra, it is impossible due to technical complexity, because the quality of the expected service will be deteriorated, etc.
|
||||
Thus, the burden of the privacy-preserving process falls on the various aggregators of personal/private data, who should also provide the necessary technical solutions to ensure data privacy for every user (what is also described as \emph{global} data privacy protection).
|
||||
|
||||
The discussion so far explains and justifies the current situation in the privacy-preserving scientific area.
|
||||
As a matter of fact, a wealth of algorithms have been proposed for privacy-preserving data publishing, either for microdata or statistical data.
|
||||
Moreover, privacy-preserving algorithms are designed specifically for data published at one point in time (used in what we call \emph{snapshot} data publishing) or data released over or concerning a period of time (used in what we call \emph{continuous data publishing}).
|
||||
In that respect, we need to be able to correctly choose the proper privacy algorithm(s), which would allow users to share protected copies of their data with some guarantees.
|
||||
The selection process is far from trivial, since it is essential to:
|
||||
\begin{enumerate}
|
||||
\itemsep-0.25em
|
||||
\item select an appropriate privacy-preserving technique, relevant to the data set intended for public release;
|
||||
\item understand the different requirements imposed by the selected technique and tune the different parameters according to the circumstances of the use case based on, e.g.,~assumptions, level of distortion, etc.~\cite{kifer2011no};
|
||||
\item get the necessary balance between privacy and data utility, which is a significant task for any privacy algorithm as well as any privacy expert.
|
||||
\end{enumerate}
|
||||
Selecting the wrong privacy algorithm or configuring it poorly may put at risk the privacy of the involved individuals and/or end up deteriorating the quality and therefore the utility of the data set.
|
||||
|
||||
\begin{figure}[htp]
|
||||
\centering
|
||||
\includegraphics[width=.5\linewidth]{data-value}
|
||||
\caption{Value of data for decision-making over time from less than seconds to more than months~\cite{gualtieri2016perishable}.}
|
||||
\label{fig:data-value}
|
||||
\end{figure}
|
||||
|
||||
In data privacy research, privacy in continuous data publishing scenarios is the area that is concerned by studying the privacy problems created when sensitive data are published continuously, either infinitely (e.g.,~streaming data) or by multiple continuous publications over a known period of time (e.g.,~finite time series data).
|
||||
This specific subfield of data privacy becomes increasingly important since it:
|
||||
\begin{enumerate}[(i)]
|
||||
\itemsep-0.25em
|
||||
\item includes the most prominent cases, e.g.,~location (trajectory) privacy problems, and
|
||||
\item provides the most challenging and yet not well charted part of the privacy algorithms since it is rather new and increasingly complex.
|
||||
\end{enumerate}
|
||||
|
||||
In this context, we seek to offer a guide that would allow its users to choose the proper algorithm(s) for their specific use case accordingly.
|
||||
Additionally, data in continuous data publishing use cases require a timely processing because their value usually decreases over time depending on the use case as demonstrated in Figure~\ref{fig:data-value}.
|
||||
For this reason, we provide an insight into time-related properties of the algorithms, e.g.,~if they work on infinite, real-time data, or if they take into consideration existing data dependencies.
|
||||
The importance of continuous data publishing is stressed by the fact that, commonly, many types of data have such properties, with geospatial data being a prominent case.
|
||||
A few examples include---but are not limited to---data being produced while tracking the movement of individuals for various purposes (where data might also need to be privacy-protected in real-time and in a continuous fashion); crowdsourced data that are used to report measurements, such as noise or pollution (where again we have a continuous timestamped and usually georeferenced stream of data); and even isolated data items that might include location information, such as photographs or social media posts.
|
||||
Typically, in such cases, we have a collection of data referring to the same individual or set of individuals over a period of time, which can also be infinite.
|
||||
Additionally, in many cases, the privacy-preserving processes should take into account implicit correlations and restrictions that exist, e.g.,~space-imposed collocation or movement restrictions.
|
||||
Since these data are related to most of the important applications and services that enjoy high utilization rates, privacy-preserving continuous data publishing becomes one of the emblematic problems of our time.
|
||||
|
||||
|
||||
\section{Contributions}
|
||||
\label{sec:contr}
|
||||
|
||||
|
||||
\section{Structure}
|
||||
\label{sec:struct}
|
91
text/main.tex
Normal file
91
text/main.tex
Normal file
@ -0,0 +1,91 @@
|
||||
\documentclass[
|
||||
10pt,
|
||||
a4paper,
|
||||
chapterprefix=on,
|
||||
oneside
|
||||
]{scrbook}
|
||||
|
||||
\usepackage{adjustbox}
|
||||
\usepackage[ruled,lined,noend,linesnumbered]{algorithm2e}
|
||||
\usepackage{amsmath}
|
||||
\usepackage{amssymb}
|
||||
\usepackage[french, english]{babel}
|
||||
\usepackage{booktabs}
|
||||
\usepackage{caption}
|
||||
\usepackage[noadjust]{cite}
|
||||
\usepackage{enumerate}
|
||||
\usepackage{float}
|
||||
\usepackage[T1]{fontenc}
|
||||
\usepackage[a4paper]{geometry}
|
||||
\usepackage{graphicx}
|
||||
\PassOptionsToPackage{hyphens}{url}
|
||||
\usepackage{hyperref}
|
||||
\usepackage[utf8]{inputenc}
|
||||
\usepackage{multirow}
|
||||
\usepackage{stmaryrd}
|
||||
\usepackage{subcaption}
|
||||
\usepackage[normalem]{ulem}
|
||||
\usepackage[table]{xcolor}
|
||||
\usepackage{arydshln}
|
||||
|
||||
% Deal with overfull lines.
|
||||
% https://texfaq.org/FAQ-overfull
|
||||
\setlength{\emergencystretch}{3em}
|
||||
\hbadness=99999
|
||||
|
||||
\graphicspath{{../graphics/}}
|
||||
\DeclareGraphicsExtensions{.pdf, .png}
|
||||
|
||||
\definecolor{material_red}{HTML}{d50000}
|
||||
\definecolor{material_green}{HTML}{00c853}
|
||||
\definecolor{material_blue}{HTML}{2962ff}
|
||||
|
||||
\newcommand{\kat}[1]{\noindent\textcolor{material_red}{\textbf{Katerina:} #1}}
|
||||
\newcommand{\mk}[1]{\noindent\textcolor{material_green}{\textbf{Manos:} #1}}
|
||||
\newcommand{\dk}[1]{\noindent\textcolor{material_blue}{\textbf{Dimitris:} #1}}
|
||||
|
||||
\newcommand{\thetitle}{Quality \& Privacy in User-generated Big Data: Algorithms \& Techniques}
|
||||
|
||||
\newcommand{\thething}{landmark}
|
||||
\newcommand{\Thething}{\titlecap{\thething}}
|
||||
\newcommand{\thethings}{\thething s}
|
||||
\newcommand{\Thethings}{\Thething s}
|
||||
|
||||
\newcommand*\includetable[1]{\input{../tables/#1.tex}}
|
||||
|
||||
\newtheorem{definition}{Definition}
|
||||
\newtheorem{example}{Example}[section]
|
||||
\newtheorem{proposition}{Proposition}
|
||||
\newtheorem{theorem}{Theorem}
|
||||
\newtheorem{corollary}{Corollary}[theorem]
|
||||
|
||||
\begin{document}
|
||||
|
||||
\input{titlepage}
|
||||
|
||||
\frontmatter
|
||||
|
||||
\input{abstract}
|
||||
\input{acknowledgements}
|
||||
|
||||
\tableofcontents
|
||||
|
||||
\listofalgorithms
|
||||
\listoffigures
|
||||
\listoftables
|
||||
|
||||
\mainmatter
|
||||
|
||||
% \nocite{*}
|
||||
|
||||
\input{introduction}
|
||||
\input{preliminaries}
|
||||
\input{related}
|
||||
\input{conclusion}
|
||||
|
||||
\backmatter
|
||||
|
||||
\bibliographystyle{alpha}
|
||||
\bibliography{bibliography}
|
||||
|
||||
\end{document}
|
418
text/micro.tex
Normal file
418
text/micro.tex
Normal file
@ -0,0 +1,418 @@
|
||||
\section{Microdata}
|
||||
\label{sec:micro}
|
||||
|
||||
As observed in Table~\ref{tab:micro}, privacy-preserving algorithms for microdata rely mostly on $k$-anonymity or derivatives of it.
|
||||
Ganta et al.~\cite{ganta2008composition} revealed that $k$-anonymity methods are vulnerable to complementary release attacks (or \emph{composition attacks} in the original publication).
|
||||
Consequently, the research community proposed solutions based on $k$-anonymity, focusing on different threats linked to continuous publication, as we review later on.
|
||||
However, notice that only a couple~\cite{li2016hybrid,shmueli2015privacy}
|
||||
of the following works assume that data sets are privacy-protected \emph{independently} of one another, meaning that the publisher is oblivious of the rest of the publications.
|
||||
On the other side, algorithms that are based on differential privacy are not concerned with so specific attacks as, by definition, differential privacy considers that the adversary may possess any kind of background knowledge.
|
||||
Later on, data dependencies were also considered for differential privacy algorithms, to account for the extra privacy loss entailed by them.
|
||||
|
||||
\includetable{table-micro}
|
||||
|
||||
|
||||
\subsection{Finite observation}
|
||||
\label{subsec:micro-finite}
|
||||
|
||||
% Anonymizing sequential releases
|
||||
% - microdata
|
||||
% - finite (sequential)
|
||||
% - batch
|
||||
% - complementary release (form the quasi-identifiers from joining releases)
|
||||
% - user
|
||||
% - k-anonymity
|
||||
% - generalization + specialisation
|
||||
\hypertarget{wang2006anonymizing}{Wang and Fung}~\cite{wang2006anonymizing} address the problem of anonymously releasing different projections (i.e.,~subsets of the attributes) of the same data set in subsequent timestamps.
|
||||
More precisely, the authors want to protect individual information that could be revealed from joining various releases of the same data set.
|
||||
To do so, instead of locating the quasi-identifiers in a single release, the authors suggest that the identifiers may span the current and all previous releases of the (projections of the) data set.
|
||||
Then, the proposed method uses the join of the different releases on the common identifying attributes.
|
||||
The goal is to generalize the identifying attributes of the current release, given that previous releases are immutable.
|
||||
The generalization is performed in a top down manner, meaning that the attributes are initially over-generalized, and step by step are specialized until they reach the point when predefined quality and privacy requirements are met.
|
||||
The privacy requirement is the so-called \emph{($X$, $Y$)-privacy} for a threshold $k$, meaning that the identifying attributes in $X$ are linked with at most $k$ sensitive values in $Y$, in the join of the previously released and current data sets.
|
||||
The quality requirement can be tuned into the framework.
|
||||
Namely, the authors propose three alternatives: the reduction of the class entropy~\cite{quinlan2014c4, shannon2001mathematical}, the notion of distortion, and the discernibility~\cite{bayardo2005data}.
|
||||
The anonymization algorithm for releasing a data set in the existence of a previously released data set takes into account the scalability and performance problems that a join among those two may entail.
|
||||
Still, when many previous releases exist, the complexity would remain high.
|
||||
|
||||
% Anonymity for continuous data publishing
|
||||
% - microdata
|
||||
% - finite (incremental)
|
||||
% - batch
|
||||
% - complementary release (tuple correspondance attack)
|
||||
% - user
|
||||
% - k-anonymity
|
||||
% - generalization + specialization
|
||||
\hypertarget{fung2008anonymity}{Fung et al.}~\cite{fung2008anonymity} introduce the problem of privately releasing continuous incremental data sets.
|
||||
As a reminder, the invariant of this kind of releases is that at every timestamp $t_i$, the records previously released at $t_j$ ($j < i$) are released again together with a set of new records.
|
||||
The authors first focus in two consecutive releases and describe three classes of possible attacks, which fall under the general category of complementary attacks.
|
||||
They name these attacks \emph{correspondence attacks} because they rely on the principle that all tuples from an original data set $D_1$, from timestamp $t_1$, correspond to a tuple in the data set $D_2$, from timestamp $t_2$.
|
||||
Naturally, the opposite does not hold, as tuples added at $t_2$ do not exist in $D_1$.
|
||||
Assuming that the attacker knows the quasi-identifiers and the timestamp of the record of a person, they define the \emph{backward}, \emph{cross}, and \emph{forward} (\emph{BCF}) attacks.
|
||||
They show that combining two individually $k$-anonymized subsequent releases using one of the aforementioned attacks can lead to `cracking' some of the records in the set of $k$ candidate tuples rendering the privacy level lower than $k$.
|
||||
Except for the detection of cases of compromising BCF anonymity between two releases, the authors also provide an anonymization algorithm for a release $\pmb{o}_2$ in the presence of a private release $\pmb{o}_1$.
|
||||
The algorithm starts from the most possible generalized state for the quasi-identifiers of the records in $D_2$.
|
||||
Step by step, it checks which combinations of specializations on the attributes do not violate the BCF anonymity and outputs the most possible specialized version of the data set.
|
||||
The authors discuss how the framework extends to multiple releases and to different kinds of privacy methods (other than $k$-anonymity).
|
||||
It is worth noting that to maintain a certain quality for a release, it is essential that the delta among subsequent releases is large enough; otherwise the needed generalization level may destroy the utility of the data set.
|
||||
|
||||
% K anonymity for trajectories with spatial distortion
|
||||
% - microdata
|
||||
% - finite (sequential)(trajectories)
|
||||
% - batch
|
||||
% - complementary release
|
||||
% - user
|
||||
% - clustering & k-anonymity
|
||||
% - distortion (on the centroid)
|
||||
\hypertarget{abul2008never}{Abul et al}.~\cite{abul2008never} defined \emph{($k$, $\delta$)-anonymity} for enabling high-quality moving-objects data sets publishing.
|
||||
The authors claim that the classical $k$-anonymity framework cannot be directly applied to such kind of data from a data-centric perspective.
|
||||
The traditional distortion techniques in $k$-anonymity, i.e.,~generalization or suppression, yield great loss of information.
|
||||
On the one hand, suppression diminishes the size of the database.
|
||||
On the other hand, generalization demands the existence of quasi-identifiers, the values of which are going to be generalized.
|
||||
In trajectories, however, all points can be equally considered as quasi-identifiers.
|
||||
Obviously, a generalization of all the trajectories points would yield great levels of distortion.
|
||||
For this reason, a new, spatial-based distortion method is proposed.
|
||||
After clustering the trajectories in groups of at least $k$ elements, each trajectory is translated into a new one, in a vicinity of a predefined threshold $\delta$.
|
||||
Of course, the newly generated trajectories should still form a $k$-anonymous set.
|
||||
The authors validate their theory by experimentally showing that the resulting distance of count queries executed over a data set and its ($k$, $\delta$) version, remains low.
|
||||
However, a comparative evaluation to existing clustering techniques, e.g.,~$k$-means would have been interesting, to better support the contributions on this part of the solution as well.
|
||||
|
||||
% Privacy-utility trade-off under continual observation
|
||||
% - microdata
|
||||
% - finite
|
||||
% - batch/streaming
|
||||
% - dependence
|
||||
% - user
|
||||
% - perturbation (randomization)
|
||||
% - temporal correlations (HMM)
|
||||
% - local
|
||||
\hypertarget{erdogdu2015privacy}{Erdogdu and Fawaz}~\cite{erdogdu2015privacy} consider the scenario where privacy-conscious individuals separate the data that they generate into sensitive, and non-sensitive.
|
||||
The individuals keep the former unreleased, and publish samples of the latter to a service provider.
|
||||
Privacy mapping, implemented as a stochastic process, distorts the non-sensitive data samples locally, and a separable distortion metric (e.g.,~Hamming distance) calculates the discrepancy of the distorted data from the original.
|
||||
The goal of the privacy mapping is to find a balance between the distortion and privacy metric, i.e.,~achieve maximum released data utility, while offering sufficient privacy guarantees.
|
||||
The authors assume that there is a data dependence (modeled with an HMM) between the two data sets, and thus the release of the distorted data set can reveal information about the sensitive one.
|
||||
They investigate both a simple attack setting, and a complex one.
|
||||
In the simple attack, the adversary can make static assumptions, based only on the so far made observations that cannot be later altered.
|
||||
In the complex attack, past, and future data releases affect dynamically the assumptions that an adversarial entity makes.
|
||||
In both cases, the framework quantifies the information leakage at any time point using a privacy metric that measures the improvement of the adversarial inference of the sensitive data set, which the individual kept secret, after observing the data released at that particular point.
|
||||
Throughout the process, the authors consider both the batch, and the streaming processing schemes.
|
||||
However, the assumption that individuals are privacy-conscious can drastically limit the applicability of the framework.
|
||||
Furthermore, the metrics that the framework utilizes for the evaluation of the privacy guarantees that it provides are not intuitive.
|
||||
|
||||
% M-invariance: towards privacy preserving re-publication of dynamic data sets
|
||||
% - microdata
|
||||
% - finite
|
||||
% - batch
|
||||
% - complementary release (intersection of sensitive values)
|
||||
% - user
|
||||
% - k-anonymity
|
||||
% - generalization + synthetic data insertion
|
||||
\hypertarget{xiao2007m}{Xiao et al.}~\cite{xiao2007m} consider the case when a data set is (re)published in different timestamps
|
||||
in an update (insert/delete tuple) manner.
|
||||
More precisely, they address data anonymization in continuous publishing by implementing $m$-\emph{invariance}.
|
||||
In a simple $k$-anonymity (or $l$-diverse) scenario the privacy of an individual existing in two updates can be compromised by the intersection of the set of sensitive values.
|
||||
In contrast, an individual who exists in a series of $m$-invariant releases is always associated with the same set of $m$ different sensitive values.
|
||||
To enable the publishing of $m$-invariant data sets, artificial tuples (\emph{counterfeits}) may be added in a release.
|
||||
To minimize the noise added to the data sets, the authors provide an algorithm with two extra desiderata: limit the counterfeits, and minimize the quasi-identifiers' generalization level.
|
||||
Still, the choice of adding tuples with specific sensitive values disturbs the value distribution with a direct effect on any relevant statistics analysis.
|
||||
|
||||
% Preventing equivalence attacks in updated, anonymized data
|
||||
% - microdata
|
||||
% - finite
|
||||
% - batch
|
||||
% - complementary release (equivalence attack)
|
||||
% - user
|
||||
% - m-invariance (k-anonymity)
|
||||
% - generalization + synthetic data insertion
|
||||
In the same update setting (insert/delete tuple), \hypertarget{he2011preventing}{He et al.}~\cite{he2011preventing} introduce another kind of attack, namely the \emph{equivalence} attack, not taken into account by the aforementioned $m$-invariance technique.
|
||||
The equivalence attack allows for sets of individuals to be considered equivalent as far as the sensitive attribute is concerned, in different timestamps.
|
||||
In this way, all the members of the equivalence class will be harmed, if the sensitive value is learned even for only one member.
|
||||
For a number of releases to be private, they have to be both $m$-invariant and $e$-equivalent ($e < m$).
|
||||
The authors propose an algorithm incorporating $m$-invariance, based on the graph optimization \emph{min cut} problem, for publishing $e$-equivalent data sets.
|
||||
The proposed method can achieve better levels of privacy, in comparable times and quality as $m$-invariance.
|
||||
|
||||
% Privacy by diversity in sequential releases of databases
|
||||
% - microdata
|
||||
% - finite (sequential)
|
||||
% - batch
|
||||
% - complementary release (unknown previous releases )
|
||||
% - user
|
||||
% - l-diversity
|
||||
% - generalization + permutation of sensitive information among tuples with the same quasi-identifiers
|
||||
\hypertarget{Shmueli}{Shmueli and Tassa}~\cite{shmueli2015privacy} identified the computational inefficiency of anonymously releasing a data set, taking into account previous ones, in scenarios of continuous data publishing.
|
||||
The released data sets contain subsets of attributes of an original data set, while the authors propose an extension for attribute addition.
|
||||
Their algorithm can compute $l$-diverse anonymized releases (over different subsets of attributes) in parallel by generating $l - 1$ so-called \emph{fake} worlds.
|
||||
A fake world is generated from the base data set by randomly permutating non-identifier and sensitive values among the tuples, in such a way that minimal information loss (quality desideratum) is incurred.
|
||||
This is partially accomplished by verifying that the permutation is done among quasi-identifiers that are similar.
|
||||
Then, the algorithm creates buckets of tuples with at least $l$ different sensitive values, in which the quasi-identifiers will then be generalized in order to achieve $l$-diversity (privacy protection desideratum).
|
||||
The generalization step is also conducted in an information-loss efficient way.
|
||||
All different releases will be $l$-diverse because they are created assuming the same possible worlds, with which they are consistent.
|
||||
Tuples/attributes deletion is briefly discussed and left as an open question.
|
||||
The article is contrasted with a previous work~\cite{shmueli2012limiting} of the same authors, claiming that the new approach considers a stronger adversary (the adversary knows all individuals with their quasi-identifiers in the data set, and not only one), and that the computation is much more efficient, as it does not have an exponential complexity with respect to the number of previous publications.
|
||||
|
||||
% A hybrid approach to prevent composition attacks for independent data releases
|
||||
% - microdata
|
||||
% - finite
|
||||
% - batch
|
||||
% - complementary release (releases unknown to the publisher)
|
||||
% - user
|
||||
% - k-anonymity
|
||||
% - generalization + noise (from normal distribution)
|
||||
\hypertarget{li2016hybrid}{Li et al.}~\cite{li2016hybrid} identified a common characteristic in most of the privacy techniques: when anonymizing a data set all previous releases are known to the data publisher.
|
||||
However, it is probable that the releases are independent from each other, and that the data publisher is unaware of these releases when anonymizing the data set.
|
||||
In such a setting, the previous techniques would suffer from composition attacks.
|
||||
The authors define this kind of adversary and propose a hybrid model for data anonymization.
|
||||
More precisely, the publisher/adversary knows that an individual exists in two different anonymized versions of the same data set, he has a hold of the anonymized versions, but the anonymization is done independently (i.e.,~without considering the previously anonymized data sets) for each data set.
|
||||
The key idea in fighting a composition attack is to enforce the probability that the matches among tuples from two data sets are random, linking different rather than the same individual.
|
||||
To do so, the proposed privacy protection method exploits three preprocessing steps before applying a traditional $k$-anonymity or $l$-diversity algorithm.
|
||||
First, the data set is sampled so as to blur the knowledge of the existence of individuals.
|
||||
Then, especially in small data sets, quasi-identifiers are distorted by noise addition before the classical generalization step.
|
||||
The noise is taken from a normal distribution with the mean and standard deviation values calculated on the corresponding quasi-identifier values.
|
||||
In the case of sparse data, the sensitive values are generalized along with the quasi-identifiers.
|
||||
The danger of composition attacks is less prominent when using this method on top of $k$-anonymity rather than without, while having comparable quality results.
|
||||
The authors also provide a comparison to data set release using $\varepsilon$-differential privacy, demonstrating that their techniques are superior with respect to quality because in the opponent algorithm the noise is added up for each of the sensitive attribute to be protected.
|
||||
Even though the authors use in the experiments two different values for $\varepsilon$, a better experiment would have been to compare the quality/privacy ratio between the two methods.
|
||||
This is a good attempt to independently anonymize multiple times the same data set; nevertheless, the scenario is restricted to releases over the same database schema, using the same perturbation, and generalization functions.
|
||||
|
||||
% Publishing trajectories with differential privacy guarantees
|
||||
% - microdata (trajectories)
|
||||
% - finite
|
||||
% - batch
|
||||
% - linkage
|
||||
% - event
|
||||
% - differential privacy
|
||||
% - perturbation (Laplace)
|
||||
% - Seems to belong to the local scheme but in the scenario/evaluation they release multiple trajectories.
|
||||
\hypertarget{jiang2013publishing}{Jiang et al.}~\cite{jiang2013publishing} focus on ship trajectories with known starting and terminal points.
|
||||
More specifically, they study different noise addition mechanisms for publishing trajectories with differential privacy guarantees.
|
||||
These mechanisms include adding global noise to the trajectory, and local noise to either each location point or the coordinates of each point of the trajectory.
|
||||
The first two mechanisms sample noisy radius from an exponential distribution, while the latter adds noise drawn from a Laplace distribution to each coordinate of every location.
|
||||
By comparing these different techniques, they conclude that the latter offers better privacy guarantee and smaller error bound.
|
||||
Nonetheless, the resulting trajectory is noticeably distorted due to the addition of Laplace noise to the original coordinates.
|
||||
To tackle this issue, they design the \emph{Sampling Distance and Direction} (SDD) mechanism.
|
||||
This mechanism allows the publishing of optimal next possible trajectory point by sampling, from the probability distribution of the exponential mechanism,
|
||||
a suitable distance and direction at the current position, while taking into account the ship's maximum speed constraint.
|
||||
Due to the fact that SDD utilizes the exponential mechanism, it outperforms the other three mechanisms, and maintains a good utility-privacy balance.
|
||||
|
||||
% Differentially private trajectory data publication
|
||||
% - microdata (trajectories)
|
||||
% - finite
|
||||
% - batch
|
||||
% - linkage
|
||||
% - user
|
||||
% - differential privacy
|
||||
% - perturbation (Laplace)
|
||||
\hypertarget{chen2011differentially}{Chen et al.}~\cite{chen2011differentially} propose a non-interactive data-dependent privacy-preserving algorithm to generate a differentially private release of trajectory data.
|
||||
The algorithm relies on a noisy prefix tree, i.e.,~an ordered search tree data structure used to store an associative array.
|
||||
Each node represents a location, from a set of possible locations that any user can be present at, of a trajectory and contains a perturbed count, which represents the number of individuals at the current location, with noise drawn from a Laplace distribution.
|
||||
The privacy budget is equally allocated to each level of the tree representing a timestamp.
|
||||
At each level, and for every node, the algorithm seeks for the children nodes with non-zero number of trajectories (non-empty nodes) to continue expanding them.
|
||||
An empty node has a noisy count lower than a threshold that is dependent on the available privacy budget and the height of the tree.
|
||||
All children nodes associate with disjoint data subsets, and thus the algorithm can utilize for every node all of the available budget at every tree level, according to the parallel composition theorem of differential privacy.
|
||||
To generate the anonymized database, it is necessary to traverse the prefix tree once in post-order, paying attention to terminating (empty) nodes.
|
||||
During this process, taking into account some consistency constraints helps to avoid erroneous trajectories due to the noise injection.
|
||||
Namely, each node of a path should have a count that is greater than or equal to the counts of its children, and each node of a path should have a count that is greater than the sum of the counts of all of its children.
|
||||
Increasing the privacy budget results in less average relative error because less noise is added at each level, and thus improves quality.
|
||||
By increasing the height of the tree, the relative error initially decreases as more information is retained from the database.
|
||||
However, after a certain threshold, the increase of height can result in less available privacy budget at each level, and thus more relative error due to the increased perturbation.
|
||||
|
||||
% Protecting Locations with Differential Privacy under Temporal Correlations
|
||||
% - microdata (trajectories)
|
||||
% - finite
|
||||
% - batch
|
||||
% - dependence
|
||||
% - user?
|
||||
% - $\delta$-location set (differential privacy)
|
||||
% - perturbation (Laplace / Planar Isotropic Mechanism (PIM))
|
||||
% - temporal correlations (Markov)
|
||||
% - local
|
||||
\hypertarget{xiao2015protecting}{Xiao et al.}~\cite{xiao2015protecting} propose another privacy definition based on differential privacy that accounts for temporal correlations in geo-tagged data.
|
||||
Location transitions between two consecutive timestamps are determined by temporal correlations modeled through a Markov chain.
|
||||
A \emph{$\delta$-location} set includes all the probable locations a user might appear at, excluding locations of low probability.
|
||||
Therefore, the true location is hidden in the resulting set, in which any pair of locations are indistinguishable.
|
||||
The lower the value of $\delta$, the more locations are included and hence, the higher the level of privacy that is achieved.
|
||||
The authors use the \emph{Planar Isotropic Mechanism} (PIM) as perturbation mechanism, which they designed upon their proof that $l_1$-norm sensitivity fails to capture the exact sensitivity in a multidimensional space.
|
||||
For this reason, PIM utilizes instead \emph{sensitivity hull}, an independent notion of the context of location privacy.
|
||||
In~\cite{xiao2017loclok}, the authors demonstrate the functionality of their system \emph{LocLok}, which implements the concept of $\delta$-location.
|
||||
|
||||
% Time distortion anonymization for the publication of mobility data with high utility
|
||||
% - microdata (trajectory)
|
||||
% - finite
|
||||
% - batch
|
||||
% - linkage
|
||||
% - event
|
||||
% - temporal transformation
|
||||
% - perturbation
|
||||
% - local
|
||||
\hypertarget{primault2015time}{Primault et al.}~\cite{primault2015time} proposed \emph{Promesse}, an algorithm that builds on time distortion instead of location distortion when releasing trajectories.
|
||||
Promesse takes as input an individual's mobility trace comprising of a data set of pairs of geolocations and timestamps, and a parameter $\varepsilon$.
|
||||
The latter indicates the desired distance between the location points that will be publicly released.
|
||||
Initially, Promesse extracts regularly spaced locations, and interpolates each one of the locations at a distance depending on the previous location and the value of $\varepsilon$.
|
||||
Then, it removes the first and last locations of the mobility trace, and assigns uniformly distributed timestamps to the remaining locations of the trajectory.
|
||||
Hence, the resulting trace has a smooth speed, and therefore places where the individual stayed longer, e.g.,~home, work, etc., are indistinguishable.
|
||||
The algorithm needs to know the starting and ending point of the trajectory; thus, it can only apply to offline scenarios.
|
||||
Furthermore, it works better with fine grained data sets because in this way it can achieve optimal geolocation and timestamp pairing.
|
||||
Moreover, the definition of $\varepsilon$ cannot provide versatile privacy protection since it is data dependent.
|
||||
|
||||
% Differentially Private and Utility Preserving Publication of Trajectory Data
|
||||
% - microdata (trajectory)
|
||||
% - finite
|
||||
% - batch
|
||||
% - linkage
|
||||
% - user
|
||||
% - differential privacy
|
||||
% - perturbation (Laplace)
|
||||
% - global
|
||||
\hypertarget{gursoy2018differentially}{Gursoy et al.}~\cite{gursoy2018differentially} designed \emph{DP-Star}, a differential privacy framework that publishes synthetic trajectories featuring similar statistics compared to the original ones.
|
||||
By utilizing the \emph{Minimum Description Length} (MDL) principle~\cite{grunwald2007minimum}, DP-Star eliminates redundant data points in the original trajectories, and generates trajectories containing only representative points.
|
||||
In this way, it is necessary to allocate the available privacy budget to far less data points, striking a balance between preciseness and conciseness.
|
||||
Moreover, the algorithm constructs a density-aware grid, with granularity that adapts to the geographical density of the trajectory points of the data set and preserves the spatial density despite any necessary perturbation.
|
||||
Then, DP-Star preserves the dependence between the trajectories' start and end points by extracting (through a first-order Markov mobility model) the trip distribution, and the intra-trajectory mobility.
|
||||
Finally, a Median Length Estimation (MLE) mechanism approximates the trajectories' lengths, and the framework generates privacy and utility preserving synthetic trajectories.
|
||||
Every phase of the process consumes some predefined privacy budget, keeping the respective products of each phase
|
||||
private and eligible for publishing.
|
||||
The authors compare their design with that of~\cite{chen2012differentially} and~\cite{he2015dpt} by running several tests, and ascertain that it outperforms them in terms of data utility.
|
||||
However, due to DP-Star's privacy budget distribution to its different phases, for small values of $\varepsilon$ the framework's privacy performance is inferior to that of its competitors.
|
||||
|
||||
|
||||
\subsection{Infinite observation}
|
||||
\label{subsec:micro-infinite}
|
||||
|
||||
% Continuous privacy preserving publishing of data streams
|
||||
% - microdata
|
||||
% - infinite
|
||||
% - stream
|
||||
% - as k-anonymity
|
||||
% - event
|
||||
% - k-anonymity
|
||||
% - generalization
|
||||
\hypertarget{zhou2009continuous}{Zhou et al.}~\cite{zhou2009continuous} introduce the problem of infinite private data publishing, and propose a randomized solution based on $k$-anonymity.
|
||||
More precisely, they continuously publish equivalence classes of size greater than or equal to $k$ containing generalized tuples from distinct persons (or identifiers in general).
|
||||
To create the equivalence classes they set several desiderata.
|
||||
Except for the size of a class, which should be greater than or equal to $k$, the information loss occurred by the generalization should be minimal, whereas the delay in forming and publishing the class should be kept low as well.
|
||||
To achieve these requirements, they built a randomized model using the popular structure of $R$-trees, extended to accommodate data density distribution information.
|
||||
In this way, they achieve a better quality/publishing delay ratio for the released private data.
|
||||
On the one hand, the formed classes contain data items that are close to each other (in dense areas), while on the other hand, classes with tuples of sparse areas are released as soon as possible so that the delay will remain low.
|
||||
|
||||
% Maskit: Privately releasing user context streams for personalized mobile applications
|
||||
% - microdata (context)
|
||||
% - infinite
|
||||
% - streaming
|
||||
% - dependence
|
||||
% - event
|
||||
% - $\delta$-privacy
|
||||
% - suppression
|
||||
% - temporal (Markov)
|
||||
% - local
|
||||
\hypertarget{gotz2012maskit}{Gotz et al.}~\cite{gotz2012maskit} developed \emph{MaskIt}, a system that interfaces the sensors of a personal device, identifies various sets of contexts, and releases a stream of privacy-preserving contexts to untrusted applications installed on the device.
|
||||
A context represents the circumstances that form the setting for an event, e.g.,~`at the office', `running', etc.
|
||||
The individuals have to define the sensitive contexts that they wish to be protected, and the desired level of privacy.
|
||||
The system models the individuals' various contexts, and transitions between them.
|
||||
It captures temporal correlations, and models individuals' movement in the space using Markov chains while taking into account historical observations.
|
||||
After the initialization, MaskIt filters a stream of individual's contexts by checking for each context whether it is safe to release it or it is necessary to suppress it.
|
||||
The authors define \emph{$\delta$-privacy} as the privacy model of MaskIt.
|
||||
More specifically, a system preserves $\delta$-privacy
|
||||
if the difference between the posterior and prior knowledge of an adversary after observing an output at any possible timestamp is bounded by $\delta$.
|
||||
After filtering all the elements of an input stream, MaskIt releases an output sequence for a single day.
|
||||
The system can repeat the process to publish longer context streams.
|
||||
The expected number of released contexts quantifies the utility of the system.
|
||||
|
||||
% PLP: Protecting location privacy against correlation analyze Attack in crowdsensing
|
||||
% - microdata (context, location)
|
||||
% - infinite
|
||||
% - streaming
|
||||
% - dependence
|
||||
% - event
|
||||
% - $\delta$-privacy
|
||||
% - suppression
|
||||
% - spatiotemporal (CRF)
|
||||
% - local
|
||||
\hypertarget{ma2017plp}{Ma et al.}~\cite{ma2017plp} propose \emph{PLP} (Protecting Location Privacy), a crowdsensing scheme that protects location privacy against adversaries that can extract spatiotemporal correlations from crowdsensing data.
|
||||
PLP filters an individual's context (location, sensing data) stream while it takes into consideration long-range dependencies among locations and reported sensing data, which are modeled by CRFs.
|
||||
It suppresses sensing data at all sensitive locations while data at non-sensitive locations are reported with a certain probability defined by observing the corresponding CRF model.
|
||||
On the one hand, the scheme estimates the privacy of the reported data by the difference $\delta$ between the probability that an individual would be at a specific location given the supplementary information versus the same probability without the extra information.
|
||||
On the other hand, it quantifies the utility by measuring the total amount of reported data (more is better).
|
||||
An estimation algorithm searches for the optimal strategy that maximizes utility while preserving a predefined privacy threshold.
|
||||
|
||||
% An adaptive geo-indistinguishability mechanism for continuous LBS queries
|
||||
% - microdata
|
||||
% - infinite/finite (not clear)
|
||||
% - streaming
|
||||
% - dependence
|
||||
% - event
|
||||
% - geo-indistinguishability
|
||||
% - perturbation (planar Laplace)
|
||||
% - local
|
||||
\hypertarget{al2018adaptive}{Al-Dhubhani and Cazalas}~\cite{al2018adaptive} propose an adaptive privacy-preserving technique based on geo-indistinguishability, which adjusts the amount of noise required to obfuscate an individual's location based on its correlation level with the previously published locations.
|
||||
Before adding noise, an evaluation of the adversary's ability to estimate an individual's position takes place.
|
||||
This process utilizes a regression algorithm for a certain prediction window that exploits previous location releases.
|
||||
More concretely, in areas with locations presenting strong correlations, an adversary can predict the current location with low estimation error.
|
||||
Consequently, it is necessary to add more noise to the locations prior to their release.
|
||||
Adapting the amount of injected noise depending on the data correlation level might lead to a better performance, in terms of both privacy and utility, in the short term.
|
||||
However, alternating the amount of injected noise at each timestamp, without
|
||||
ensuring the preservation of the features (including correlations) present in the original data, might lead to arbitrary utility loss.
|
||||
|
||||
% Preventing velocity-based linkage attacks in location-aware applications
|
||||
% - microdata (trajectory)
|
||||
% - infinite
|
||||
% - streaming
|
||||
% - dependence (velocity)
|
||||
% - event
|
||||
% - temporal and spatial cloaking
|
||||
% - local and global
|
||||
\hypertarget{ghinita2009preventing}{Ghinita et al.}~\cite{ghinita2009preventing} tackle attacks to location privacy that arise from the linkage of maximum velocity with cloaked regions when using an LBS.
|
||||
The authors propose methods that can prevent the disclosure of the exact location coordinates of an individual, and bound the association probability of an individual to a sensitive location-related feature.
|
||||
The first method is based on temporal cloaking and utilizes deferral, and postdating.
|
||||
Deferral delays the disclosure of a cloaked region that is impossible for an individual to have reached based on the latest region that she published and her known maximum speed.
|
||||
Postdating reports the nearest previous cloaked region that will allow the LBS to return relevant results with high probability, since the two regions are close.
|
||||
The second method implements spatial cloaking.
|
||||
First, it creates cloaked regions by taking into account all of the user-specified sensitive features that are relevant to the current location (filtering of features).
|
||||
Then, it enlarges the area of the region to satisfy the privacy requirements (cloaking).
|
||||
Finally, it defers the publishing of the region until it includes the current timestamp (safety enforcement) similar to temporal cloaking.
|
||||
The system measures the quality of service of both methods in terms of the cloaked region size, time and space error, and failure ratio.
|
||||
The cloaked region size is important because larger regions may decrease the utility of the information that the LBS might return.
|
||||
The time and space error is possible due to delayed location reporting and region cloaking.
|
||||
Failure ratio corresponds to the percentage of dropped queries in cases where it is impossible to satisfy the privacy requirements.
|
||||
Although both methods experimentally prove to offer adequate quality of service, the privacy requirements and metrics that the authors consider do not offer substantial privacy guarantees for commercial application.
|
||||
|
||||
% A Trajectory Privacy-Preserving Algorithm Based on Road Networks in Continuous Location-based Services
|
||||
% - microdata (trajectory)
|
||||
% - infinite
|
||||
% - streaming
|
||||
% - linkage
|
||||
% - event
|
||||
% - $l$-diversity
|
||||
% - generalization (cloaking)
|
||||
% - LBS but global
|
||||
\hypertarget{ye2017trajectory}{Ye et al.}~\cite{ye2017trajectory} present an $l$-diversity method for producing a cloaked area, based on the local road network, for protecting trajectories.
|
||||
A trusted entity divides the spatial region of interest based on the density of the road network, using quadtree structures, until every subregion contains at least $l$ road segments.
|
||||
Then, it creates a database for each subregion by generating all the possible trajectories based on real road network information.
|
||||
The trusted entity uses this database, when individuals attempt to interact with an LBS by sending their current location, to predict their next locations.
|
||||
Thereafter, it selects the $l - 1$ nearest trajectories to the individual's current location, and constructs a minimum cloaking region.
|
||||
The resulting cloaking area covers the $l$ nearest trajectories and ensures a minimum area of coverage.
|
||||
This method addresses the limitations of $k$-anonymity in terms of continuous data publishing of trajectories.
|
||||
The required calculation of every possible trajectory, for the construction of a trajectory database for every subregion, might require an arbitrary amount of computations depending on the area's features.
|
||||
Nonetheless, the utilization of quadtrees can limit the overhead of the searching process.
|
||||
|
||||
% Quantifying Differential Privacy under Temporal Correlations
|
||||
% - statistical
|
||||
% - infinite/finite
|
||||
% - streaming
|
||||
% - dependence
|
||||
% - mainly (w-)event but also user
|
||||
% - differential privacy
|
||||
% - perturbation (Laplace)
|
||||
% - temporal correlations (Markov)
|
||||
\hypertarget{cao2017quantifying}{Cao et al.}~\cite{cao2017quantifying,cao2018quantifying} propose a method for computing the temporal privacy loss of a differential privacy mechanism in the presence of temporal correlations and background knowledge.
|
||||
The goal of their technique is to guarantee privacy protection and to bound the privacy loss at every time point under the assumption of independent data releases.
|
||||
It calculates the temporal privacy loss as the sum of the backward and forward privacy loss minus the default privacy loss $\varepsilon$ of the mechanism (because it is counted twice in the aforementioned entities).
|
||||
This calculation is done for each individual that is included in the original data set, and the overall temporal privacy loss is equal to the maximum calculated value at every time point.
|
||||
The backward/forward privacy loss at any time point depends on the backward/forward privacy loss at the previous/next instance, the backward/forward temporal correlations, and $\varepsilon$.
|
||||
The authors propose solutions to bound the temporal privacy loss, under the presence of weak to moderate correlations, in both finite and infinite data publishing scenarios.
|
||||
In the latter case, they try to find a value for $\varepsilon$ for which the backward and forward privacy loss are equal.
|
||||
In the former, they similarly try to balance the backward and forward privacy loss while they allocate more $\varepsilon$ at the first and last time points, since they have higher impact to the privacy loss of the next and previous ones.
|
||||
This way they achieve an overall constant temporal privacy loss throughout the time series.
|
||||
According to the technique's intuition, stronger correlations result in higher privacy loss.
|
||||
However, the loss is smaller when the dimension of the transition matrix, which is extracted according to the modeling of the correlations (here it is Markov chain), is larger due to the fact that larger transition matrices tend to be uniform, resulting in weaker data dependence.
|
||||
The authors investigate briefly all of the possible privacy levels; however, the solutions that they propose are suitable only for the event-level.
|
||||
Last but not least, the technique requires the calculation of the temporal privacy loss for every individual within the data set which might prove computationally inefficient in real-time scenarios.
|
584
text/preliminaries.tex
Normal file
584
text/preliminaries.tex
Normal file
@ -0,0 +1,584 @@
|
||||
\chapter{Preliminaries}
|
||||
\label{ch:prel}
|
||||
|
||||
In this chapter, we introduce some relevant terminology and background knowledge around the problem of continuous publishing of sensitive data sets.
|
||||
First, we categorize data as we view them in the context of continuous data publishing.
|
||||
Second, we define data privacy, we list the kinds of attacks that have been identified in the literature, as well as the desired privacy levels that can be achieved, and the basic privacy operations that are applied to achieve data privacy.
|
||||
Third, we provide a brief overview of the seminal works on privacy-preserving data publishing, used also in continuous data publishing, fundamental in the domain and important for the understanding of the rest of the survey.
|
||||
|
||||
To accompany and facilitate the descriptions in this chapter, we provide the following running example.
|
||||
|
||||
\begin{example}
|
||||
\label{ex:snapshot}
|
||||
Users interact with an LBS by making queries in order to retrieve some useful location-based information or just reporting user-state at various locations.
|
||||
This user--LBS interaction generates user-related data, organized in a schema with the following attributes: \emph{Name} (the unique identifier of the table), \emph{Age}, \emph{Location}, and \emph{Status} (Table~\ref{tab:snapshot-micro}).
|
||||
The `Status' attribute includes information that characterizes the user's state or the query itself, and its value varies according to the service functionality.
|
||||
Subsequently, the generated data are aggregated (by issuing count queries over them) in order to derive useful information about the popularity of the venues during the day (Table~\ref{tab:snapshot-statistical}).
|
||||
|
||||
\begin{table}
|
||||
\centering\hspace{\fill}
|
||||
\subcaptionbox{Microdata\label{tab:snapshot-micro}}{%
|
||||
\begin{tabular}{@{}lrll@{}}
|
||||
\toprule
|
||||
\textit{Name} & \multicolumn{1}{c}{Age} & Location & Status \\
|
||||
\midrule
|
||||
Donald & $27$ & Le Marais & at work \\
|
||||
Daisy & $25$ & Belleville & driving \\
|
||||
Huey & $12$ & Montmartre & running \\
|
||||
Dewey & $11$ & Montmartre & at home \\
|
||||
Louie & $10$ & Latin Quarter & walking \\
|
||||
Quackmore & $62$ & Opera & dining \\
|
||||
\bottomrule
|
||||
\end{tabular}%
|
||||
}\hspace{\fill}
|
||||
\subcaptionbox{Statistical data\label{tab:snapshot-statistical}}{%
|
||||
\begin{tabular}{@{}lr@{}}
|
||||
\toprule
|
||||
Location & \multicolumn{1}{c@{}}{Count} \\
|
||||
\midrule
|
||||
Belleville & $1$ \\
|
||||
Latin Quarter & $1$ \\
|
||||
Le Marais & $1$ \\
|
||||
Montmartre & $2$ \\
|
||||
Opera & $1$ \\
|
||||
\bottomrule
|
||||
\\
|
||||
\end{tabular}%
|
||||
}\hspace{\fill}
|
||||
\caption{Example of raw user-generated (a)~microdata, and related (b)~statistical data for a specific timestamp.}
|
||||
\label{tab:snapshot}
|
||||
\end{table}
|
||||
\end{example}
|
||||
|
||||
|
||||
\section{Data}
|
||||
\label{sec:data}
|
||||
|
||||
|
||||
\subsection{Categories}
|
||||
\label{subsec:data-categories}
|
||||
|
||||
As this survey is about privacy, the data that we are interested in, contain information about individuals and their actions.
|
||||
We firstly classify the data based on their content:
|
||||
|
||||
\begin{itemize}
|
||||
\item \emph{Microdata}---the data items in their raw, usually tabular, form pertaining to individuals or objects.
|
||||
\item \emph{Statistical data}---the outcome of statistical processes on microdata.
|
||||
\end{itemize}
|
||||
|
||||
An example of microdata is displayed in Table~\ref{tab:snapshot-micro}, while an example of statistical data in Table~\ref{tab:snapshot-statistical}.
|
||||
Data, in either of these two forms, may have a special property called~\emph{continuity}, i.e.,~their values change and can be observed through time.
|
||||
Depending on the span of observation, we distinguish the following categories:
|
||||
|
||||
\begin{itemize}
|
||||
\item \emph{Finite data}---data are observed during a predefined time interval.
|
||||
\item \emph{Infinite data}---data are observed in an uninterrupted fashion.
|
||||
\end{itemize}
|
||||
|
||||
\begin{example}
|
||||
\label{ex:continuous}
|
||||
Extending Example~\ref{ex:snapshot}, Table~\ref{tab:continuous} shows an example of continuous data observation, by introducing one data table for each consecutive timestamp.
|
||||
The two data tables, over the time-span $[t_1, t_2]$ are an example of finite data.
|
||||
Infinite data are the whole series of data obtained over the period~$[t_1, \infty)$ (infinity is denoted by `\dots').
|
||||
|
||||
\begin{table}
|
||||
\centering
|
||||
\subcaptionbox{Microdata\label{tab:continuous-micro}}{%
|
||||
\adjustbox{max width=\linewidth}{%
|
||||
\begin{tabular}{@{}ccc@{}}
|
||||
\begin{tabular}{@{}lrll@{}}
|
||||
\toprule
|
||||
\textit{Name} & \multicolumn{1}{c}{Age} & Location & Status \\
|
||||
\midrule
|
||||
Donald & $27$ & Le Marais & at work \\
|
||||
Daisy & $25$ & Belleville & driving \\
|
||||
Huey & $12$ & Montmartre & running \\
|
||||
Dewey & $11$ & Montmartre & at home \\
|
||||
Louie & $10$ & Latin Quarter & walking \\
|
||||
Quackmore & $62$ & Opera & dining \\
|
||||
\bottomrule
|
||||
\multicolumn{4}{c}{$t_1$} \\
|
||||
\end{tabular} &
|
||||
\begin{tabular}{@{}lrll@{}}
|
||||
\toprule
|
||||
\textit{Name} & \multicolumn{1}{c}{Age} & Location & Status \\
|
||||
\midrule
|
||||
Donald & $27$ & Montmartre & driving \\
|
||||
Daisy & $25$ & Montmartre & at the mall \\
|
||||
Huey & $12$ & Latin Quarter & sightseeing \\
|
||||
Dewey & $11$ & Opera & walking \\
|
||||
Louie & $10$ & Latin Quarter & at home \\
|
||||
Quackmore & $62$ & Montmartre & biking \\
|
||||
\bottomrule
|
||||
\multicolumn{4}{c}{$t_2$} \\
|
||||
\end{tabular} &
|
||||
\dots
|
||||
\end{tabular}%
|
||||
}%
|
||||
} \\ \bigskip
|
||||
\subcaptionbox{Statistical data\label{tab:continuous-statistical}}{%
|
||||
\begin{tabular}{@{}lrrr@{}}
|
||||
\toprule
|
||||
\multirow{2}{*}{Location} & \multicolumn{3}{c@{}}{Count}\\
|
||||
& \multicolumn{1}{c}{$t_1$} & \multicolumn{1}{c}{$t_2$} & \dots \\
|
||||
\midrule
|
||||
Belleville & $1$ & $0$ & \dots \\
|
||||
Latin Quarter & $1$ & $2$ & \dots \\
|
||||
Le Marais & $1$ & $0$ & \dots \\
|
||||
Montmartre & $2$ & $3$ & \dots \\
|
||||
Opera & $1$ & $1$ & \dots \\
|
||||
\bottomrule
|
||||
\end{tabular}%
|
||||
}%
|
||||
\caption{Continuous data observation of (a)~microdata, and corresponding (b)~statistics at multiple timestamps.}
|
||||
\label{tab:continuous}
|
||||
\end{table}
|
||||
\end{example}
|
||||
|
||||
We further define two sub-categories applicable to both finite and infinite data: \emph{sequential} and \emph{incremental} data; these two subcategories are not exhaustive, i.e.,~not all data sets belong to the one or the other category.
|
||||
In sequential data, the value of the observed variable changes, depending on its previous value.
|
||||
For example, trajectories are finite sequences of location stamps, as naturally the position at each timestamp is connected to the position at the previous timestamp.
|
||||
In incremental data, an original data set is augmented in each subsequent timestamp with supplementary information.
|
||||
For example, trajectories can be considered as incremental data, when at each timestamp we consider all the previously visited locations by an individual, incremented by his current position.
|
||||
|
||||
|
||||
\subsection{Processing and publishing}
|
||||
\label{subsec:data-publishing}
|
||||
|
||||
We categorize data processing and publishing based on the implemented scheme, as:
|
||||
|
||||
\begin{itemize}
|
||||
\item \emph{Global}---data are collected, processed and privacy-protected, and then published by a central (trusted) entity, e.g.,~\cite{mcsherry2009privacy, blocki2013differentially, johnson2018towards}.
|
||||
\item \emph{Local}---data are stored, processed and privacy-protected on the side of data generators before sending them to any intermediate or final entity, e.g.,~\cite{andres2013geo, erlingsson2014rappor, katsomallos2017open}.
|
||||
\end{itemize}
|
||||
|
||||
\begin{figure}[htp]
|
||||
\centering
|
||||
\subcaptionbox{Global scheme\label{fig:scheme-global}}{%
|
||||
\includegraphics[width=\linewidth]{scheme-global}%
|
||||
} \\ \bigskip
|
||||
\subcaptionbox{Local scheme\label{fig:scheme-local}}{%
|
||||
\includegraphics[width=\linewidth]{scheme-local}%
|
||||
}
|
||||
\caption{The usual flow of user-generated data, optionally harvested by data publishers, privacy-protected, and released to data consumers, according to the (a)~global, and (b)~local privacy schemes.}
|
||||
\label{fig:privacy-schemes}
|
||||
\end{figure}
|
||||
|
||||
In the case of location data privacy, the existing literature is divided in
|
||||
\emph{service-} and \emph{data-}centric methods~\cite{chow2011trajectory}.
|
||||
The service-centric methods correspond to scenarios where individuals share their privacy-protected location with a service to get some relevant information (local publishing scheme).
|
||||
The data-centric methods relate to the publishing of user-generated data to data consumers (global publishing scheme).
|
||||
|
||||
There is a long-standing debate whether the local or the global architectural scheme is more efficient with respect to not only privacy, but also organizational, economic, and security factors~\cite{king1983centralized}.
|
||||
On the one hand, in the global privacy scheme (Figure~\ref{fig:scheme-global}), the dependence on third-party entities poses the risk of arbitrary privacy leakage from a compromised data publisher.
|
||||
Nonetheless, the expertise of these entities is usually superior to that of the majority of (non-technical) data generators' in terms of understanding privacy permissions/\allowbreak policies and setting-up relevant preferences.
|
||||
Moreover, in the global architecture, less distortion is necessary before publicly releasing the aggregated data set, naturally because the data sets are larger and users can be `hidden' more easily.
|
||||
On the other hand, the local privacy scheme (Figure~\ref{fig:scheme-local}) facilitates fine-grained data management, offering to every individual better control over their data~\cite{goldreich1998secure}.
|
||||
Nonetheless, data distortion at an early stage might prove detrimental to the overall utility of the aggregated data set.
|
||||
The so far consensus is that there is no overall optimal solution among the two designs.
|
||||
Most service-providing companies prefer the global scheme, mainly for reasons of better management and control over the data, while several privacy advocates support the local privacy scheme that offers users full control over what and how data are published.
|
||||
Although there have been attempts to bridge the gap between them, e.g.,~\cite{bittau2017prochlo}, the global scheme is considerably better explored and implemented~\cite{satyanarayanan2017emergence}.
|
||||
For this reason, most of the works in this survey span this context.
|
||||
|
||||
We distinguish between two publishing modes for private data: \emph{snapshot} and \emph{continuous}.
|
||||
In snapshot publishing (also appearing as \emph{one-shot} or \emph{one-off} publishing), the system processes and releases a data set at a specific point in time and thereafter is not concerned anymore with the specific data set.
|
||||
For example, in Figure~\ref{fig:mode-snapshot} (ignore the privacy-preserving step for the moment) individuals send their data to an LBS provider, considering a specific time point.
|
||||
In continuous data publishing the system computes, and publishes augmented or updated versions of one data set in different time points, and without a predefined duration.
|
||||
In the context of privacy-preserving data publishing, privacy preservation is tightly coupled with the data processing and publishing stages.
|
||||
|
||||
As already discussed in Section~\ref{ch:intro}, in this survey we are studying the continuous data publishing mode, and thus we do not include works considering the snapshot paradigm.
|
||||
We make this deliberate choice as privacy-preserving continuous data publishing is a more complex problem, receiving more and more attention from the scientific community in the recent years, as shown by the increasing number of publications in this area.
|
||||
Moreover, the use cases of continuous data publishing abound, with the proliferation of the Internet, sensors, and connected devices, which produce and send to servers huge amounts of continuous personal data in astounding speed.
|
||||
|
||||
We identify two main data processing and publishing modes:
|
||||
|
||||
\begin{itemize}
|
||||
\item \emph{Batch}---data are considered in groups in specific time intervals.
|
||||
\item \emph{Streaming}---data are considered per timestamp, infinitely.
|
||||
\end{itemize}
|
||||
|
||||
\begin{figure}[htp]
|
||||
\centering
|
||||
\subcaptionbox{Snapshot mode\label{fig:mode-snapshot}}{%
|
||||
\includegraphics[width=.4\linewidth]{mode-snapshot}%
|
||||
} \\ \bigskip\hspace{\fill}
|
||||
\subcaptionbox{Batch mode\label{fig:mode-batch}}{%
|
||||
\includegraphics[width=.4\linewidth]{mode-batch}%
|
||||
}\hspace{\fill}
|
||||
\subcaptionbox{Streaming mode\label{fig:mode-streaming}}{%
|
||||
\includegraphics[width=.4\linewidth]{mode-streaming}%
|
||||
}\hspace{\fill}
|
||||
\caption{The different data processing and publishing modes of continuously generated data sets.
|
||||
(a)~Snapshot publishing, (b)~continuous publishing--batch mode, and (c)~continuous publishing--streaming mode.
|
||||
$\pmb{o}_x$ denotes the privacy-protected version of the data set $D_x$ or statistics thereof, while `\dots' denote the continuous data generation and/or publishing, where applicable.
|
||||
Depending on the data observation span, $n$ can either be finite or tend to infinity.}
|
||||
\label{fig:privacy-modes}
|
||||
\end{figure}
|
||||
|
||||
Batch data processing and publishing (Figure~\ref{fig:mode-batch}) is performed (usually offline) over both finite and infinite data, while streaming processing and publishing (Figure~\ref{fig:mode-streaming}) is by definition connected to infinite data (usually in real-time).
|
||||
|
||||
|
||||
\section{Privacy}
|
||||
\label{sec:privacy}
|
||||
|
||||
When personal data are publicly released, either as microdata or statistical data, individuals' privacy can be compromised, i.e,~an adversary becomes certain about an individual's personal information with a probability higher than a desired threshold.
|
||||
In the literature, this compromise is know as \emph{information disclosure} and is usually categorized as~\cite{li2007t, wang2010privacy, narayanan2008robust}:
|
||||
|
||||
\begin{itemize}
|
||||
\item \emph{Presence disclosure}---the participation (or absence) of an individual in a data set is revealed.
|
||||
\item \emph{Identity disclosure}---an individual is linked to a particular record.
|
||||
\item \emph{Attribute disclosure}---new information (attribute value) about an individual is revealed.
|
||||
\end{itemize}
|
||||
|
||||
In the literature, identity disclosure is also referred to as \emph{record linkage}, and presence disclosure as \emph{table linkage}.
|
||||
Notice that identity disclosure can result in attribute disclosure, and vice versa.
|
||||
|
||||
To better illustrate these definitions, we provide some examples based on Table~\ref{tab:snapshot}.
|
||||
Presence disclosure appears when by looking at the (privacy-protected) counts of Table~\ref{tab:snapshot-statistical}, we can guess if Quackmore has participated in Table~\ref{tab:snapshot-micro}.
|
||||
Identity disclosure appears when we can guess that the sixth record of (a privacy-protected version of) the microdata of Table~\ref{tab:snapshot-micro} belongs to Quackmore.
|
||||
Attribute disclosure appears when it is revealed from (a privacy-protected version of) the microdata of Table~\ref{tab:snapshot-micro} that Quackmore is $62$ years old.
|
||||
|
||||
|
||||
\subsection{Levels}
|
||||
\label{subsec:privacy-levels}
|
||||
|
||||
The information disclosure that a data release may entail is often linked to the protection level that a privacy-preserving algorithm is trying to achieve.
|
||||
More specifically, in continuous data publishing the privacy protection level is considered with respect to not only the users but also to the \emph{events} occurring in the data.
|
||||
An event is considered as a pair of an identifying attribute of an individual and the sensitive data (including contextual information), and can be seen as a correspondence to a record in a database, where each individual may participate once.
|
||||
Data publishers typically release events in the form of data points' sequences usually indexed in time order (time series), and geotagged, e.g.,~(`Dewey', `at home at Montmartre at $t_1$'), \dots, (`Quackmore', `dining at Opera at $t_1$').
|
||||
The term `users' is used to refer to the \emph{individuals}, also known as \emph{participants}, who are the source of the processed and published data.
|
||||
Therefore, they should not be confused with the consumers of the released data sets.
|
||||
Users are subject to privacy attacks, and thus are the main point of interest of privacy protection mechanisms.
|
||||
In more detail, the privacy protection levels are:
|
||||
|
||||
\begin{itemize}
|
||||
\item \emph{Event}~\cite{dwork2010differential, dwork2010pan}---\emph{any single event} of any individual is protected.
|
||||
\item \emph{User}~\cite{dwork2010differential, dwork2010pan}---\emph{all the events} of any individual, spanning the observed event sequence, are protected.
|
||||
\item \emph{$w$-event}~\cite{kellaris2014differentially}---\emph{any sequence of $w$ events}, within the released series of events, of any individual is protected.
|
||||
\end{itemize}
|
||||
|
||||
Figure~\ref{fig:privacy-levels} demonstrates the application of the possible protection levels on the statistical data of Example~\ref{ex:continuous}.
|
||||
For instance, in event-level (Figure~\ref{fig:level-event}) it is hard to determine whether Quackmore was dining at Opera at $t_1$.
|
||||
Moreover, in user-level (Figure~\ref{fig:level-user}) it is hard to determine whether Quackmore was ever included in the released series of events at all.
|
||||
Finally, in $2$-event-level (Figure~\ref{fig:level-w-event}) it is hard to determine whether Quackmore was ever included in the released series of events between the timestamps $t_1$ and $t_2$, $t_2$ and $t_3$, etc. (i.e.,~for a window $w = 2$).
|
||||
|
||||
\begin{figure}[htp]
|
||||
\centering
|
||||
\hspace{\fill}\subcaptionbox{Event-level\label{fig:level-event}}{%
|
||||
\includegraphics[width=.32\linewidth]{level-event}%
|
||||
}\hspace{\fill}
|
||||
\subcaptionbox{User-level\label{fig:level-user}}{%
|
||||
\includegraphics[width=.32\linewidth]{level-user}%
|
||||
}\hspace{\fill}
|
||||
\subcaptionbox{$2$-event-level\label{fig:level-w-event}}{%
|
||||
\includegraphics[width=.32\linewidth]{level-w-event}%
|
||||
}\hspace{\fill}
|
||||
\caption{Protecting the data of Table~\ref{tab:continuous-statistical} on (a)~event-, (b)~user-, and (c)~$2$-event-level. A suitable distortion method can be applied accordingly.}
|
||||
\label{fig:privacy-levels}
|
||||
\end{figure}
|
||||
|
||||
Contrary to event-level that provides privacy guarantees for a single event, user- and $w$-event-level offer stronger privacy protection by protecting a series of events.
|
||||
In use-cases that involve infinite data, event- and $w$-event-level attain an adequate balance between data utility and user privacy, whereas user-level is more appropriate when the span of data observation is predefined.
|
||||
$w$-event- is narrower than user-level protection due to its sliding window processing methodology.
|
||||
In the extreme cases where $w$ is set to either $1$ or to the size of the entire length of the series of events, $w$-event- matches event- or user-level protection, respectively.
|
||||
Although the described levels have been coined in the context of \emph{differential privacy}~\cite{dwork2006calibrating}, a seminal privacy method that we will discuss in more detail in Section~\ref{subsec:privacy-statistical}, it is possible to apply their definitions to other privacy protection techniques as well.
|
||||
|
||||
|
||||
\subsection{Attacks}
|
||||
\label{subsec:privacy-attacks}
|
||||
|
||||
Information disclosure is typically achieved by combining supplementary (background) knowledge with the released data or by setting unrealistic assumptions while designing the privacy-preserving algorithms.
|
||||
In its general form, this is known as \emph{adversarial} or \emph{linkage} attack.
|
||||
Even though many works directly refer to the general category of linkage attacks, we distinguish also the following sub-categories, addressed in the literature:
|
||||
|
||||
\begin{itemize}
|
||||
\item \emph{Sensitive attribute domain} knowledge.
|
||||
Here we can identify \emph{homogeneity and skewness} attacks~\cite{machanavajjhala2006diversity,li2007t}, when statistics of the sensitive attribute values are available, and \emph{similarity attack}, when semantics of the sensitive attribute values are available.
|
||||
\item \emph{Complementary release} attacks~\cite{sweeney2002k} with regard to previous releases of different versions of the same and/or related data sets.
|
||||
In this category, we also identify the \emph{unsorted matching} attack~\cite{sweeney2002k}, which is achieved when two privacy-protected versions of an original data set are published in the same tuple ordering.
|
||||
Other instances include: (i)~the \emph{join} attack~\cite{wang2006anonymizing}, when tuples can be identified by joining (on the (quasi-)identifiers) several releases, (ii)~the \emph{tuple correspondence} attack~\cite{fung2008anonymity}, when in case of incremental data certain tuples correspond to certain tuples in other releases, in an injective way, (iii)~the \emph{tuple equivalence} attack~\cite{he2011preventing}, when tuples among different releases are found to be equivalent with respect to the sensitive attribute, and (iv)~the \emph{unknown releases} attack~\cite{shmueli2015privacy}, when the privacy preservation is performed without knowing the previously privacy-protected data sets.
|
||||
\item \emph{Data dependence}
|
||||
\begin{itemize}
|
||||
\item within one data set.
|
||||
Data tuples and data values within a data set may be correlated, or linked in such a way that information about one person can be inferred even if the person is absent from the database.
|
||||
Consequently, in this category we put assumptions made on the data generation model based on randomness, like the random world model, the independent and identically distributed data (i.i.d.) model, or the independent-tuples model, which may be unrealistic for many real-world scenarios.
|
||||
This attack is also known as the \emph{deFinetti's attack}~\cite{kifer2009attacks}.
|
||||
\item among one data set and previous data releases, and/or other external sources~\cite{kifer2011no, chen2014correlated, liu2016dependence, zhao2017dependent}.
|
||||
The strength of the dependence between a pair of variables can be quantified with the utilization of \emph{correlations}~\cite{stigler1989francis}.
|
||||
Correlation implies dependence but not vice versa, however, the two terms are often used as synonyms.
|
||||
The correlation among nearby observations, i.e.,~the elements in a series of data points, are referenced as \emph{autocorrelation} or \emph{serial correlation}~\cite{park2018fundamentals}.
|
||||
Depending on the evaluation technique, e.g.,~\emph{Pearson's correlation coefficient}~\cite{stigler1989francis}, a correlation can be characterized as \emph{negative}, \emph{zero}, or \emph{positive}.
|
||||
A negative value shows that the behavior of one variable is the \emph{opposite} of that of the other, e.g.,~when the one increases the other decreases.
|
||||
Zero means that the variables are not linked and are \emph{independent} of each other.
|
||||
A positive correlation indicates that the variables behave in a \emph{similar} manner, e.g.,~when the one decreases the other decreases as well.
|
||||
|
||||
The most prominent types of correlations might be:
|
||||
\begin{itemize}
|
||||
\item \emph{Temporal}~\cite{wei2006time}---appearing in observations (i.e.,~values) of the same object over time.
|
||||
\item \emph{Spatial}~\cite{legendre1993spatial, anselin1995local}---denoted by the degree of similarity of nearby data points in space, and indicating if and how phenomena relate to the (broader) area where they take place.
|
||||
\item \emph{Spatiotemporal}---a combination of the previous categories, appearing when processing time series or sequences of human activities with geolocation characteristics, e.g.,~\cite{ghinita2009preventing}.
|
||||
\end{itemize}
|
||||
Contrary to one-dimensional correlations, spatial correlation is multi-dimensional and multi-directional, and can be measured by indicators (e.g.,~\emph{Moran's I}~\cite{moran1950notes}) that reflect the \emph{spatial association} of the concerned data.
|
||||
Spatial autocorrelation has its foundations in the \emph{First Law of Geography} stating that ``everything is related to everything else, but near things are more related than distant things''~\cite{tobler1970computer}.
|
||||
A positive spatial autocorrelation indicates that similar data are \emph{clustered}, a negative that data are dispersed and are close to dissimilar ones, and when close to zero, that data are \emph{randomly arranged} in space.
|
||||
\end{itemize}
|
||||
|
||||
A common practice for extracting data dependencies from continuous data, is by expressing the data as a \emph{stochastic} or \emph{random process}.
|
||||
A random process is a collection of \emph{random variables} or \emph{bivariate data}, indexed by some set, e.g.,~a series of timestamps, a Cartesian plane $\mathbb{R}^2$, an $n$-dimensional Euclidean space, etc.~\cite{skorokhod2005basic}.
|
||||
The values a random variable can take are outcomes of an unpredictable process, while bivariate data are pairs of data values with a possible association between them.
|
||||
Expressing data as stochastic processes allows their modeling depending on their properties, and thereafter the discovery of relevant data dependencies.
|
||||
Some common stochastic processes modeling techniques include:
|
||||
|
||||
\begin{itemize}
|
||||
\item \emph{Conditional probabilities}~\cite{allan2013probability}---probabilities of events in the presence of other events.
|
||||
\item \emph{Conditional Random Fields} (CRFs)~\cite{lafferty2001conditional}---undirected graphs encoding conditional probability distributions.
|
||||
\item \emph{Markov processes}~\cite{rogers2000diffusions}---stochastic processes for which the conditional probability of their future states depends only on the present state and it is independent of its previous states (\emph{Markov assumption}).
|
||||
\begin{itemize}
|
||||
\item \emph{Markov chains}~\cite{gagniuc2017markov}---sequences of possible events whose probability depends on the state attained in the previous event.
|
||||
\item \emph{Hidden Markov Models} (HMMs)~\cite{baum1966statistical}---statistical Markov models of Markov processes with unobserved states.
|
||||
\end{itemize}
|
||||
\end{itemize}
|
||||
|
||||
\end{itemize}
|
||||
|
||||
The first sub-category of attacks has been mainly addressed in works on snapshot microdata publishing, and is still present in continuous publishing; however, algorithms for continuous publishing typically accept the proposed solutions for the snapshot publishing scheme (see discussion over $k$-anonymity and $l$-diversity in Section~\ref{subsec:privacy-seminal}).
|
||||
This kind of attacks is tightly coupled with publishing the (privacy-protected) sensitive attribute value.
|
||||
An example is the lack of diversity in the sensitive attribute domain, e.g.,~if all users in the data set of Table~\ref{tab:snapshot-micro} shared the same \emph{running} Status (the sensitive attribute).
|
||||
The second and third subcategory are attacks emerging (mostly) in continuous publishing scenarios.
|
||||
Consider again the data set in Table~\ref{tab:snapshot-micro}.
|
||||
The complementary release attack means that an adversary can learn more things about the individuals (e.g.,~that there are high chances that Donald was at work) if he/she combines the information of two privacy-protected versions of this data set.
|
||||
By the data dependence attack, the status of Donald could be more certainly inferred, by taking into account the status of Dewey at the same moment and the dependencies between Donald's and Dewey's status, e.g.,~when Dewey is at home, then most probably Donald is at work.
|
||||
In order to better protect the privacy of Donald in case of attacks, the data should be privacy-protected in a more adequate way (than without the attacks).
|
||||
|
||||
|
||||
\subsection{Operations}
|
||||
\label{subsec:privacy-operations}
|
||||
|
||||
Protecting private information, which is known by many names (obfuscation, cloaking, anonymization, etc.), is achieved by using a specific basic privacy protection operation.
|
||||
Depending on the intervention that we choose to perform on the original data, we identify the following operations:
|
||||
|
||||
\begin{itemize}
|
||||
\item \emph{Aggregation}---group together multiple rows of a data set to form a single value.
|
||||
\item \emph{Generalization}---replace an attribute value with a parent value in the attribute taxonomy.
|
||||
Notice that a step of generalization, may be followed by a step of \emph{specialization}, to improve the quality of the resulting data set.
|
||||
\item \emph{Suppression}---delete completely certain sensitive values or entire records.
|
||||
\item \emph{Perturbation}---disturb the initial attribute value in a deterministic or probabilistic way.
|
||||
The probabilistic data distortion is referred to as \emph{randomization}.
|
||||
\end{itemize}
|
||||
|
||||
For example, consider the table schema \emph{User(Name, Age, Location, Status)}.
|
||||
If we want to protect the \emph{Age} of the user by aggregation, we may replace it by the average age in her Location; by generalization, we may replace the Age by age intervals; by suppression we may delete the entire table column corresponding to \emph{Age}; by perturbation, we may augment each age by a predefined percentage of the age; by randomization we may randomly replace each age by a value taken from the probability density function of the attribute.
|
||||
|
||||
It is worth mentioning that there is a series of algorithms (e.g.,~\cite{benaloh2009patient, kamara2010cryptographic, cao2014privacy}) based on the \emph{cryptography} operation.
|
||||
However, the majority of these methods, among other assumptions that they make, have minimum or even no trust to the entities that handle the personal information.
|
||||
Furthermore, the amount and the way of data processing of these techniques usually burden the overall procedure, deteriorate the utility of the resulting data sets, and restrict their applicability.
|
||||
Our focus is limited to techniques that achieve a satisfying balance between both participants' privacy and data utility.
|
||||
For these reasons, there will be no further discussion around this family of techniques in this article.
|
||||
|
||||
|
||||
\subsection{Seminal works}
|
||||
\label{subsec:privacy-seminal}
|
||||
|
||||
For completeness, in this section we present the seminal works for privacy-preserving data publishing, which, even though originally designed for the snapshot publishing scenario, have paved the way, since many of the works in privacy-preserving continuous publishing are based on or extend them.
|
||||
|
||||
|
||||
\subsubsection{Microdata}
|
||||
\label{subsec:privacy-micro}
|
||||
|
||||
Sweeney coined \emph{$k$-anonymity}~\cite{sweeney2002k}, one of the first established works on data privacy.
|
||||
A released data set features $k$-anonymity protection when the sequence of values for a set of identifying attributes, called the \emph{quasi-identifiers}, is the same for at least $k$ records in the data set.
|
||||
Computing the quasi-identifiers in a set of attributes is still a hard problem on its own~\cite{motwani2007efficient}.
|
||||
$k$-anonymity is syntactic, it constitutes an individual indistinguishable from at least $k-1$ other individuals in the same data set.
|
||||
In a follow-up work~\cite{sweeney2002achieving}, the author describes a way to achieve $k$-anonymity for a data set by the suppression or generalization of certain values of the quasi-identifiers.
|
||||
Machanavajjhala et al.~\cite{machanavajjhala2006diversity} pointed out that $k$-anonymity is vulnerable to homogeneity and background knowledge attacks.
|
||||
Thereby, they proposed \emph{$l$-diversity}, which demands that the values of the sensitive attributes are `well-represented' by $l$ sensitive values in each group.
|
||||
Principally, a data set can be $l$-diverse by featuring at least $l$ distinct values for the sensitive field in each group (\emph{distinct} $l$-diversity).
|
||||
Other instantiations demand that the entropy of the whole data set is greater than or equal to $\log(l)$ (\emph{entropy} $l$-diversity) or that the number of appearances of the most common sensitive value is less than the sum of the counts of the rest of the values multiplied by a user defined constant $c$ (\emph{recursive (c, l)}-diversity).
|
||||
Later on, Li et al.~\cite{li2007t} indicated that $l$-diversity can be void by skewness and similarity attacks due to sensitive attributes with a small value range.
|
||||
In such cases, \emph{$\theta$-closeness} guarantees that the distribution of a sensitive attribute in a group and the distribution of the same attribute in the whole data set is `similar'.
|
||||
This similarity is bounded by a threshold $\theta$.
|
||||
A data set features $\theta$-closeness when all of its groups feature $\theta$-closeness.
|
||||
|
||||
The main drawback of $k$-anonymity (and its derivatives) is that it is not tolerant to external attacks of re-identification on the released data set.
|
||||
The problems identified in~\cite{sweeney2002k} appear when attempting to apply $k$-anonymity on continuous data publishing (as we will also see next in Section~\ref{sec:micro}).
|
||||
These attacks include multiple $k$-anonymous data set releases with the same record order, subsequent releases of a data set without taking into account previous $k$-anonymous releases, and tuple updates.
|
||||
Proposed solutions include rearranging the attributes, setting the whole attribute set of previously released data sets as quasi-identifiers or releasing data based on previous $k$-anonymous releases.
|
||||
|
||||
|
||||
\subsubsection{Statistical data}
|
||||
\label{subsec:privacy-statistical}
|
||||
|
||||
While methods based on $k$-anonymity have been mainly employed for releasing microdata, \emph{differential privacy}~\cite{dwork2006calibrating} has been proposed for releasing high utility aggregates over microdata while providing semantic privacy guarantees.
|
||||
Differential privacy is algorithmic, it ensures that any adversary observing a privacy-protected output, no matter his/her computational power or auxiliary information, cannot conclude with absolute certainty if an individual is included in the input data set.
|
||||
Moreover, it quantifies and bounds the impact that the addition/removal of the data of an individual to/from an input data set has on the derived privacy-protected aggregates.
|
||||
|
||||
In its formal definition, a \emph{privacy mechanism} $\mathcal{M}$, which outputs a query answer with some injected randomness, satisfies $\varepsilon$-differential privacy for a user-defined privacy budget $\varepsilon$~\cite{mcsherry2009privacy} if for all pairs of \emph{neighboring} (i.e.,~differing by the data of an individual) data sets $D$ and $D'$, it holds that:
|
||||
$$\Pr[\mathcal{M}(D) \in O]\leq e^\varepsilon \Pr[\mathcal{M}(D') \in O],$$
|
||||
|
||||
\noindent where $\Pr[\cdot]$ denotes the probability of an event, and $O$ is the world of possible outputs of a mechanism $\mathcal{M}$.
|
||||
As the definition implies, for low values of $\varepsilon$, $\mathcal{M}$ achieves stronger privacy protection since the probabilities of $D$ and $D'$ being true worlds are similar, but the utility of the mechanism's output is reduced since more randomness is introduced.
|
||||
The privacy budget $\varepsilon$ has a non-zero and positive value, and is usually set to $0.01$, $0.1$, or, in some cases, $\ln2$ or $\ln3$~\cite{lee2011much}.
|
||||
|
||||
A typical mechanism example is the \emph{Laplace mechanism}~\cite{dwork2014algorithmic}, which draws randomly a value from the probability distribution of $\textrm{Laplace}(\mu, b)$, where $\mu$ stands for the location parameter and $b > 0$ the scale parameter.
|
||||
Here, $\mu$ is equal to the original output value of a query function, and $b$ is the sensitivity of the query function divided by $\varepsilon$.
|
||||
The Laplace mechanism works for any function with range the set of real numbers.
|
||||
A specialization of this mechanism for location data is the \emph{Planar Laplace mechanism}~\cite{andres2013geo}, which is based on a multivariate Laplace distribution.
|
||||
For query functions that do not return a real number, e.g.,~`What is the most visited country this year?' or in cases where perturbing the value of the output will completely destroy its utility, e.g.,~`What is the optimal price for this auction?', most works in the literature use the \emph{Exponential mechanism}~\cite{dwork2014algorithmic}.
|
||||
This mechanism utilizes a utility function $u$ that maps (input data set $D$, output value $r$) pairs to utility scores, and selects an output value $r$ from the input pairs, with probability proportional to $\exp(\frac{\varepsilon u(D, r)}{2\Delta u})$,
|
||||
where $\Delta u$ is the sensitivity of the utility function.
|
||||
Another technique for differential privacy mechanisms is the \emph{randomized response}~\cite{warner1965randomized}.
|
||||
It is a privacy-preserving survey method that introduces probabilistic noise to the statistics of a research by randomly instructing respondents to answer truthfully or `Yes' to a sensitive, binary question.
|
||||
The technique achieves this randomization by including a random event, e.g.,~the flip of a fair coin.
|
||||
The respondents reveal to the interviewers only their answer to the question, and keep as a secret the result of the random event (i.e.,~if the coin was tails or heads).
|
||||
Thereafter, the interviewers can calculate the probability distribution of the random event, e.g.,~$\frac{1}{2}$ heads and $\frac{1}{2}$ tails, and thus they can roughly eliminate the false responses and estimate the final result of the research.
|
||||
|
||||
Differential privacy mechanisms satisfy two composability properties: \emph{sequential} and \emph{parallel}~\cite{mcsherry2009privacy, soria2016big}.
|
||||
Due to the sequential composability property, the total privacy level of two independent mechanisms $\mathcal{M}_1$ and $\mathcal{M}_2$ over the same data set that satisfy $\varepsilon_1$ and $\varepsilon_2$, respectively, equals to $\varepsilon_1 + \varepsilon_2$.
|
||||
The parallel composability property dictates that, when the mechanisms $\mathcal{M}_1$ and $\mathcal{M}_2$ are applied over disjoint subsets of the same data set, then the overall privacy level is $\max_{ i\in\{1,2\}}\varepsilon_i $.
|
||||
Every time a data publisher interacts with (any part of) the original data set, it is mandatory to consume some of the available privacy budget according to the composability properties.
|
||||
This is a necessity, so as to ensure that there will be no further arbitrary privacy loss, when the released data sets will be acquired by adversaries (or simple users).
|
||||
However, \emph{post-processing} the output of a differential privacy mechanism can be done without using any additional privacy budget.
|
||||
Naturally, using the same (or different) privacy mechanism(s) multiple times to interact with raw data in combination with already perturbed data, implies that the privacy guarantee of the final output will be calculated according to sequential composition.
|
||||
|
||||
Differential privacy methods are best for low sensitivity queries such as counts, because the presence/\allowbreak absence of a single record can only change the result slightly.
|
||||
However, sum and max queries can be problematic, since a single but very different value could change the output noticeably, making it necessary to add a lot of noise to the query's answer.
|
||||
Furthermore, asking a series of queries may allow the disambiguation between possible data sets, making it necessary to add even more noise to the outputs.
|
||||
For this reason, after a series of queries exhausts the available privacy budget, the data set has to be discarded.
|
||||
Keeping the original guarantee across multiple queries that require different/\allowbreak new answers, one must inject noise proportional to the number of the executed queries, and thus destroying the utility of the output.
|
||||
|
||||
A special category of differential privacy-preserving algorithms is that of \emph{pan-private} algorithms~\cite{dwork2010pan}.
|
||||
Pan-private algorithms hold their privacy guarantees even when snapshots of their internal state (memory) are accessed during their execution by an external entity, e.g.,~subpena, security breach, etc.
|
||||
There are two intrusion types that a data publisher has to deal with when designing a pan-private mechanism: \emph{single unannounced}, and \emph{continual announced} intrusion.
|
||||
In the first, the data publisher assumes that the mechanism's state is observed by the external entity one unique time, without the data publisher ever being notified about it.
|
||||
In the latter, the external entity gains access to the mechanism's state multiple times, and the publisher is notified after each time.
|
||||
The simplest approach to deal with both cases is to make sure that the data in the memory of the mechanism have constantly the same distribution, i.e.,~they are differentially private.
|
||||
Notice that this must hold throughout the mechanism's lifetime, even before/\allowbreak after it processes any sensitive data point(s).
|
||||
|
||||
The notion of differential privacy has highly influenced the research community, resulting in many follow-up publications (\cite{mcsherry2007mechanism, kifer2011no, zhang2017privbayes} to mention a few).
|
||||
We distinguish here \emph{Pufferfish}~\cite{kifer2014pufferfish} and \emph{geo-indistinguishability}~\cite{andres2013geo,chatzikokolakis2015geo}.
|
||||
\emph{Pufferfish} is a framework that allows experts in an application domain, without necessarily having any particular expertise in privacy, to develop privacy definitions for their data sharing needs.
|
||||
To define a privacy mechanism using \emph{Pufferfish}, one has to define a set of potential secrets $\mathcal{X}$, a set of distinct pairs $\mathcal{X}_{pairs}$, and auxiliary information about data evolution scenarios $\mathcal{B}$.
|
||||
$\mathcal{X}$ serves as an explicit specification of what we would like to protect, e.g.,~`the record of an individual $x$ is (not) in the data'.
|
||||
$\mathcal{X}_{pairs}$ is a subset of $\mathcal{X} \times \mathcal{X}$ that instructs how to protect the potential secrets $\mathcal{X}$, e.g.,~(`$x$ is in the table', `$x$ is not in the table').
|
||||
Finally, $\mathcal{B}$ is a set of conservative assumptions about how the data evolved (or were generated) that reflects the adversary's belief about the data, e.g.,~probability distributions, variable correlations, etc.
|
||||
When there is independence between all the records in the original data set, then $\varepsilon$-differential privacy and the privacy definition of $\varepsilon$-\emph{Pufferfish}$(\mathcal{X}, \mathcal{X}_{pairs}, \mathcal{B})$ are equivalent.
|
||||
\emph{Geo-indistinguishability} is an adaptation of differential privacy for location data in snapshot publishing.
|
||||
It is based on $l$-privacy, which offers to individuals within an area with radius $r$, a privacy level of $l$.
|
||||
More specifically, $l$ is equal to $\varepsilon r$ if any two locations within distance $r$ provide data with similar distributions.
|
||||
This similarity depends on $r$ because the closer two locations are, the more likely they are to share the same features.
|
||||
Intuitively, the definition implies that if an adversary learns the published location for an individual, the adversary cannot infer the individual's true location, out of all the points in a radius $r$, with a certainty higher than a factor depending on $l$.
|
||||
The technique adds random noise drawn from a multivariate Laplace distribution to individuals' locations, while taking into account spatial boundaries and features.
|
||||
|
||||
\begin{example}
|
||||
\label{ex:application}
|
||||
To illustrate the usage of the microdata and statistical data techniques for privacy-preserving data publishing, we revisit Example~\ref{ex:continuous}.
|
||||
In this example, users continuously interact with an LBS by reporting their status at various locations.
|
||||
Then, the reported data are collected by the central service, in order to be protected and then published, either as a whole, or as statistics thereof.
|
||||
Notice that in order to showcase the straightforward application of $k$-anonymity and differential privacy, we apply the two methods on each timestamp independently from the previous one, and do not take into account any additional threats imposed by continuity.
|
||||
|
||||
\begin{table}
|
||||
\centering\noindent\adjustbox{max width=\linewidth} {
|
||||
\begin{tabular}{@{}ccc@{}}
|
||||
\begin{tabular}{@{}lrll@{}}
|
||||
\toprule
|
||||
\textit{Name} & \multicolumn{1}{c}{Age} & Location & Status \\
|
||||
\midrule
|
||||
* & $> 20$ & Paris & at work \\
|
||||
* & $> 20$ & Paris & driving \\
|
||||
* & $> 20$ & Paris & dining \\
|
||||
\midrule
|
||||
* & $\leq 20$ & Paris & running \\
|
||||
* & $\leq 20$ & Paris & at home \\
|
||||
* & $\leq 20$ & Paris & walking \\
|
||||
\bottomrule
|
||||
\end{tabular} &
|
||||
\begin{tabular}{@{}lrll@{}}
|
||||
\toprule
|
||||
\textit{Name} & \multicolumn{1}{c}{Age} & Location & Status \\
|
||||
\midrule
|
||||
* & $> 20$ & Paris & driving \\
|
||||
* & $> 20$ & Paris & at the mall \\
|
||||
* & $> 20$ & Paris & biking \\
|
||||
\midrule
|
||||
* & $\leq 20$ & Paris & sightseeing \\
|
||||
* & $\leq 20$ & Paris & walking \\
|
||||
* & $\leq 20$ & Paris & at home \\
|
||||
\bottomrule
|
||||
\end{tabular} &
|
||||
\dots \\
|
||||
$t_1$ & $ t_2$ & \\
|
||||
\end{tabular}%
|
||||
}%
|
||||
\caption{3-anonymous event-level protected versions of the microdata in Table~\ref{tab:continuous-micro}.}
|
||||
\label{tab:scenario-micro}
|
||||
\end{table}
|
||||
|
||||
First, we anonymize the data set of Table~\ref{tab:continuous-micro} using $k$-anonymity, with $k = 3$.
|
||||
This means that any user should not be distinguished from at least $2$ others.
|
||||
Status is the sensitive attribute, thus the attribute that we wish to protect.
|
||||
We start by suppressing the values of the Name attribute, which is the identifier.
|
||||
The Age and Location attributes are the quasi-identifiers, so we proceed to adequately generalize them.
|
||||
We turn age values to ranges ($\leq 20$, and $> 20$), and generalize location to city level (Paris).
|
||||
Finally, we achieve $3$-anonymity by putting the entries in groups of three, according to the quasi-identifiers.
|
||||
Table~\ref{tab:scenario-micro} depicts the results at each timestamp.
|
||||
|
||||
\begin{table}
|
||||
\centering
|
||||
\subcaptionbox{True counts\label{tab:statistical-true}}{%
|
||||
\begin{tabular}{@{}lr@{}}
|
||||
\toprule
|
||||
Location & \multicolumn{1}{c@{}}{Count} \\
|
||||
\midrule
|
||||
Belleville & $1$ \\
|
||||
Latin Quarter & $1$ \\
|
||||
Le Marais & $1$ \\
|
||||
Montmartre & $2$ \\
|
||||
Opera & $1$ \\
|
||||
\bottomrule
|
||||
\end{tabular}%
|
||||
}\quad
|
||||
\subcaptionbox*{}{%
|
||||
\begin{tabular}{@{}c@{}}
|
||||
\\ \\ \\
|
||||
$\xrightarrow[]{\text{Noise}}$
|
||||
\\ \\ \\
|
||||
\end{tabular}%
|
||||
}\quad
|
||||
\subcaptionbox{Perturbed counts\label{tab:statistical-noisy}}{%
|
||||
\begin{tabular}{@{}lr@{}}
|
||||
\toprule
|
||||
Location & \multicolumn{1}{c@{}}{Count} \\
|
||||
\midrule
|
||||
Belleville & $1$ \\
|
||||
Latin Quarter & $0$ \\
|
||||
Le Marais & $2$ \\
|
||||
Montmartre & $3$ \\
|
||||
Opera & $1$ \\
|
||||
\bottomrule
|
||||
\end{tabular}%
|
||||
}%
|
||||
\caption{(a)~The original version of the data of Table~\ref{tab:continuous-statistical}, and (b)~their $1$-differentially event-level private version.}
|
||||
\label{tab:scenario-statistical}
|
||||
\end{table}
|
||||
|
||||
Next, we demonstrate differential privacy.
|
||||
We apply an $\varepsilon$-differentially private Laplace mechanism, with $\varepsilon = 1$, taking into account the count query that generated the true counts of Table~\ref{tab:continuous-statistical}.
|
||||
The sensitivity of a count query is $1$ since the addition/removal of a tuple from the data set can change the final result of the query by maximum $1$ (tuple).
|
||||
Figure~\ref{fig:laplace} shows how the Laplace distribution for the true count in Montmartre at $t_1$ looks like.
|
||||
Table~\ref{tab:statistical-noisy} shows all the perturbed counts that are going to be released.
|
||||
|
||||
\begin{figure}[htp]
|
||||
\centering
|
||||
\includegraphics[width=.7\linewidth]{laplace}
|
||||
\caption{A Laplace distribution for location $\mu = 2$ and scale $b = 1$.}
|
||||
\label{fig:laplace}
|
||||
\end{figure}
|
||||
|
||||
\end{example}
|
||||
|
||||
|
||||
|
||||
\section{Summary}
|
||||
\label{sec:sum-bg}
|
||||
|
||||
This is the summary of this chapter.
|
25
text/related.tex
Normal file
25
text/related.tex
Normal file
@ -0,0 +1,25 @@
|
||||
\chapter{Related work}
|
||||
\label{ch:rel}
|
||||
|
||||
Since the domain of data privacy is vast, several surveys have already been published with different scopes.
|
||||
A group of surveys focuses on specific different families of privacy-preserving algorithms and techniques.
|
||||
For instance, Simi et al.~\cite{simi2017extensive} provide an extensive study of works on $k$-anonymity and Dwork~\cite{dwork2008differential} focuses on differential privacy.
|
||||
Another group of surveys focuses on techniques that allow the execution of data mining or machine learning tasks with some privacy guarantees, e.g.,~Wang et al.~\cite{wang2009survey}, and Ji et al.~\cite{ji2014differential}.
|
||||
In a more general scope, Wang et al.~\cite{wang2010privacy} analyze the challenges of privacy-preserving data publishing, and offer a summary and evaluation of relevant techniques.
|
||||
Additional surveys look into issues around Big Data and user privacy.
|
||||
Indicatively, Jain et al.~\cite{jain2016big}, and Soria-Comas and Domingo-Ferrer~\cite{soria2016big} examine how Big Data conflict with pre-existing concepts of privacy-preserving data management, and how efficiently $k$-anonymity and $\varepsilon$-differential privacy deal with the characteristics of Big Data.
|
||||
Others narrow down their research to location privacy issues.
|
||||
To name a few, Chow and Mokbel~\cite{chow2011trajectory} investigate privacy protection in continuous LBSs and trajectory data publishing, Chatzikokolakis et al.~\cite{chatzikokolakis2017methods} review privacy issues around the usage of LBSs and relevant protection mechanisms and metrics, Primault et al.~\cite{primault2018long} summarize location privacy threats and privacy-preserving mechanisms, and Fiore et al.~\cite{fiore2019privacy} focus only on privacy-preserving publishing of trajectory microdata.
|
||||
Finally, there are some surveys on application-specific privacy challenges.
|
||||
For example, Zhou et al.~\cite{zhou2008brief} have a focus on social networks, and Christin et al.~\cite{christin2011survey} give an outline of how privacy aspects are addressed in crowdsensing applications.
|
||||
Nevertheless, to the best of our knowledge, there is no up-to-date survey that deals with privacy under continuous data publishing covering diverse use cases.
|
||||
Such a survey becomes very useful nowadays, due to the abundance of continuously user-generated data sets that could be analyzed and/or published in a privacy-preserving way, and the quick progress made in this research field.
|
||||
|
||||
\input{micro}
|
||||
\input{statistical}
|
||||
|
||||
|
||||
\section{Summary}
|
||||
\label{sec:sum-rel}
|
||||
|
||||
This is the summary of this chapter.
|
356
text/statistical.tex
Normal file
356
text/statistical.tex
Normal file
@ -0,0 +1,356 @@
|
||||
\section{Statistical data}
|
||||
\label{sec:statistical}
|
||||
|
||||
When continuously publishing statistical data, usually in the form of counts, the most widely used privacy method is differential privacy, or derivatives of it, as witnessed in Table~\ref{tab:statistical}.
|
||||
In theory differential privacy makes no assumptions about the background knowledge available to the adversary.
|
||||
In practice, as we observe in Table~\ref{tab:statistical}, data dependencies (e.g.,~correlations) arising in the continuous publication setting are frequently (but without it being the rule) considered as attacks in the proposed algorithms.
|
||||
|
||||
\includetable{table-statistical}
|
||||
|
||||
|
||||
\subsection{Finite observation}
|
||||
\label{subsec:statistical-finite}
|
||||
|
||||
% Practical differential privacy via grouping and smoothing
|
||||
% - statistical (counts)
|
||||
% their scenario is built on location data (check-ins)
|
||||
% - finite
|
||||
% - batch
|
||||
% - linkage
|
||||
% - event
|
||||
% - differential privacy
|
||||
% - perturbation (Laplace)
|
||||
\hypertarget{kellaris2013practical}{Kellaris et al.}~\cite{kellaris2013practical} pointed out that in time series, where users might contribute to an arbitrary number of aggregates, the sensitivity of the query answering function is significantly influenced by their presence/absence in the data set.
|
||||
Thus, the Laplace perturbation algorithm, commonly used with differential privacy, may produce meaningless data sets.
|
||||
Furthermore, under such settings, the discrete Fourier transformation of the Fourier perturbation algorithm (another popular technique for data perturbation) may behave erratically, and affect the utility of the outcome of the mechanism.
|
||||
For this reason, the authors proposed their own method involving grouping and smoothing for one-time publishing of time series of non-overlapping counts, i.e.,~the aggregated data of one count does not affect any other count.
|
||||
Grouping includes partitioning the data set into similar clusters.
|
||||
The size and the similarity measure of the clusters are data dependent.
|
||||
Random grouping consumes less privacy budget, as there is minimum interaction with the original data.
|
||||
However, when using a grouping technique based on sampling, which has some privacy cost but produces better groups, the impact of the perturbation is decreased.
|
||||
During the smoothing phase, the average values for each cluster are calculated, and finally, Laplace noise is added to these values.
|
||||
In this way, the query sensitivity becomes less dependent on each individual's data, and therefore less perturbation is required.
|
||||
|
||||
% Differentially private sequential data publication via variable-length n-grams
|
||||
% - statistical (trajectories)
|
||||
% - finite
|
||||
% - batch
|
||||
% - linkage
|
||||
% - user
|
||||
% - differential privacy
|
||||
% - perturbation (adaptive Laplace)
|
||||
\hypertarget{chen2012differentially}{Chen et al.}~\cite{chen2012differentially} exploit a text-processing technique, the \emph{n-gram} model, i.e.,~a contiguous sequence of $n$ items from a given data sample, to release sequential data without releasing the noisy statistics (counts) of all of the possible sequences.
|
||||
This model allows the publishing of the most common $n$-grams ($n$ is, typically, less than $5$) to accurately reconstruct the original data set.
|
||||
The privacy technique that the authors propose is suitable for count queries and frequent sequential pattern mining scenarios.
|
||||
In particular, one of the applications that the authors consider concerns sequential spatiotemporal data (i.e.,~trajectories) of individuals.
|
||||
They group grams based on the similarity of their $n$ values, construct a search tree, and inject Laplace noise to each node value (count) to achieve user-level differential privacy protection.
|
||||
Instead of allocating the available privacy budget based on the overall maximum height of the tree, they estimate each path adaptively based on known noisy counts.
|
||||
The grouping process continues until the desired threshold of $n$ is reached.
|
||||
Thereafter, they release variable-length $n$-grams with certain thresholds for the values of counts and tree heights, allowing to deal with the trade-off of shorter grams having less information than longer ones but less relative error.
|
||||
They use a set of consistency constraints, i.e.,~the sum of each node's noisy count has to be less than or equal to its parent's noisy count, and all the noisy counts of leaf nodes have to be within a predefined threshold.
|
||||
These constraints improve the final data utility since they result in lower values of $n$.
|
||||
On the one hand, this translates into higher counts, large enough to deal with noise injection and the inherent Markov assumption in the $n$-gram model.
|
||||
On the other hand, it enhances privacy when the universe of all grams with a lower $n$ value is relatively small resulting in more common sequences, which, nonetheless, is rarely valid in real-life scenarios.
|
||||
|
||||
% Differentially private publication of general time-serial trajectory data
|
||||
% - statistical (trajectories)
|
||||
% - finite
|
||||
% - batch
|
||||
% - linkage
|
||||
% - user
|
||||
% - differential privacy
|
||||
% - perturbation (exponential, Laplace)
|
||||
\hypertarget{hua2015differentially}{Hua et al.}~\cite{hua2015differentially} use, similar to the scheme proposed in~\cite{chen2012differentially}, the $n$-grams modeling technique for publishing trajectories containing a small number of $n$-grams, thus, sharing few or even no identical prefixes.
|
||||
They propose a differentially private location-specific generalization algorithm (exponential mechanism), where each position in the trajectory is one record.
|
||||
The algorithm probabilistically partitions the locations at each timestamp with probability proportional to their Euclidean distance from each other.
|
||||
They replace each partition with its centroid and therefore, they offer better utility by creating groups of locations belonging to close trajectories.
|
||||
They optimize the algorithm for time efficiency by using classic $k$-means clustering.
|
||||
Then, the algorithm releases the new trajectories by observing the generalized location partitions, and their perturbed counts (i.e.,~sum of the same locations at each timestamp) with noise drawn from a Laplace distribution.
|
||||
The process continues until the total count of the published trajectories reaches the size of the original data set.
|
||||
They can limit the total number of the possible trajectories by taking into account the individual's moving speed.
|
||||
The authors have measured the utility of distorted spatiotemporal range queries by measuring the Hausdorff distance from the original results and concluded that the utility deterioration is within reasonable boundaries considering the offered privacy guarantees.
|
||||
Similar to~\cite{chen2012differentially}, their approach works well for a small location domain.
|
||||
To make it applicable to realistic scenarios, it is essential to truncate the original trajectories in an effort to reduce the location domain.
|
||||
This results in a coarse discretization of the location area, leading to the arbitrary distortion of the spatial correlations that are present in the original data set.
|
||||
|
||||
% Achieving differential privacy of trajectory data publishing in participatory sensing
|
||||
% - statistical (trajectories)
|
||||
% - finite
|
||||
% - batch
|
||||
% - linkage
|
||||
% - user
|
||||
% - differential privacy
|
||||
% - perturbation (Laplace)
|
||||
\hypertarget{li2017achieving}{Li et al.}~\cite{li2017achieving} focus on publishing a set of trajectories, where, contrary to~\cite{hua2015differentially}, each one is considered as a single entry in the data set.
|
||||
First, using $k$-means clustering they partition the original locations based on their pairwise Euclidean distances.
|
||||
The scheme represents each location partition by their mean (centroid).
|
||||
A larger number of partitions, in areas where close centroids exist, results in fewer locations in each partition, and thus lower trajectory precision loss.
|
||||
Before adding noise, they randomly select partition centroids to generate trajectories until they reach the size of the original data set.
|
||||
Then, they generate Laplace noise, which they bound according to a set of constraints, and they add it to the count of locations of each point of every trajectory.
|
||||
Finally, they release the generalized trajectories along with the noisy count of each location point.
|
||||
The authors prove experimentally that they reduce considerably the trajectory merging time at the expense of utility.
|
||||
|
||||
% DPT: differentially private trajectory synthesis using hierarchical reference systems
|
||||
% - statistical (trajectories)
|
||||
% - finite
|
||||
% - batch
|
||||
% - dependence
|
||||
% - user
|
||||
% - differential privacy
|
||||
% - perturbation (Laplace)
|
||||
% - spatial correlations (Hierarchical Reference Systems (HRS))
|
||||
\hypertarget{he2015dpt}{He et al.} present \emph{DPT} (Differentially Private Trajectory)~\cite{he2015dpt}, a system that synthesizes mobility data based on raw, speed-varying trajectories of individuals, while providing $\varepsilon$-differential privacy protection guarantees.
|
||||
The system constructs a Hierarchical Reference Systems (HRS) model to capture correlations between adjacent locations by imposing a uniform grid at multiple resolutions (i.e.,~for different speed values) over the space, keeping a prefix tree for each resolution, and choosing the centroids as anchor points.
|
||||
In each reference system, anchor points have a small number of neighboring points with increasing (by a constant factor) average distance between them, and fewer children anchor points as the grid resolution becomes finer.
|
||||
DPT estimates transition probabilities only for the anchor points in proximity to the last observed location, and chooses the appropriate reference system for each raw point so that the consecutive points of the trajectory are either neighboring anchors or have a parent-child relationship.
|
||||
The system generates the transition probabilities by estimating the counts in the prefix trees.
|
||||
Thereafter, it chooses the appropriate prefix trees, perturbs them with noise drawn from the Laplace distribution, and adaptively prunes subtrees with low counts to improve the resulting utility.
|
||||
DPT implements a direction-weighted sampling postprocessing strategy for the synthetic trajectories to avoid the loss of directionality of the original trajectories due to the perturbation.
|
||||
Nonetheless, as with all other similar techniques, the usage of prefix trees limits the length of the released trajectories, which results into an uneven spatial distribution.
|
||||
|
||||
% Pufferfish Privacy Mechanisms for Correlated Data
|
||||
% - statistical
|
||||
% - finite
|
||||
% - batch
|
||||
% - dependence
|
||||
% - unspecified
|
||||
% - \emph{Pufferfish}
|
||||
% - perturbation (Laplace)
|
||||
% - general (Bayesian networks/Markov chains)
|
||||
\hypertarget{song2017pufferfish}{Song et al.}~\cite{song2017pufferfish} propose the \emph{Wasserstein mechanism}, a technique that applies to any general instantiation of Pufferfish (see Section~\ref{subsec:privacy-statistical}).
|
||||
It adds noise proportional to the sensitivity of a query $F$, which depends on the worst case distance between the distributions $P(F(X)|s_i,d)$ and $P(F(X)|s_j,d)$ for a variable $X$, a pair of secrets $(s_i,s_j)$, and an evolution scenario $d$.
|
||||
The Wasserstein metric function calculates the worst case distance between those two distributions.
|
||||
The noise is drawn from a Laplace distribution with parameter equal to the quotient resulting from the division of the maximum Wasserstein distance of the distributions of all the pairs of secrets by the available privacy budget $\varepsilon$.
|
||||
For optimization purposes, the authors consider a more restricted setting.
|
||||
This setting, utilizes an evolution scenario for the data correlations representation, and Bayesian networks for the correlation modeling.
|
||||
The authors state that in cases where Bayesian networks are complex, the Markov chains are a more efficient alternative.
|
||||
A generalization of the \emph{Markov blanket} mechanism, the \emph{Markov quilt} mechanism, calculates data dependencies.
|
||||
The dependent nodes of any node consist of its parents, its children, and the other parents of its children.
|
||||
The present technique excels at data sets generated by monitoring applications or networks, but it is not suitable for online scenarios.
|
||||
|
||||
% Differentially private multi-dimensional time series release for traffic monitoring
|
||||
% - statistical (location)
|
||||
% - finite
|
||||
% - streaming
|
||||
% - dependence
|
||||
% - user
|
||||
% - differential privacy
|
||||
% - perturbation (Laplace)
|
||||
% - spatiotemporal/serial correlations
|
||||
\hypertarget{fan2013differentially}{Fan et al.}~\cite{fan2013differentially} propose a real-time framework for releasing differentially private multi-dimensional traffic monitoring data.
|
||||
At every timestamp, the Perturbation module injects noise drawn from a Laplace distribution to the data.
|
||||
Then, the Estimation module post-processes the perturbed data to improve the accuracy.
|
||||
The authors propose a temporal, and spatial estimation algorithm.
|
||||
The former estimates an internal time series model for each location to improve the utility of the perturbation's outcome by performing a posterior estimation that utilizes Gaussian approximation and Kalman filtering\cite{kalman1960new}.
|
||||
The latter reduces data sparsity by grouping neighboring locations using a spatial indexing structure based on quadtree.
|
||||
The Modeling/Aggregation module utilizes domain knowledge, e.g.,~road network and density, and has a bidirectional interaction with the other two in parallel.
|
||||
Although the authors propose the framework for real-time scenarios, they do not deal with infinite data processing/publication, which limits considerably its applicability.
|
||||
|
||||
% An Adaptive Approach to Real-Time Aggregate Monitoring With Differential Privacy
|
||||
% - statistical
|
||||
% - finite
|
||||
% - streaming
|
||||
% - linkage
|
||||
% - user
|
||||
% - differential privacy
|
||||
% - perturbation (dynamic Laplace)
|
||||
In another work, \hypertarget{fan2014adaptive}{Fan et al.} designed \emph{FAST}~\cite{fan2014adaptive}, an adaptive system that allows the release of real-time aggregate time series under user-level differential privacy.
|
||||
These were achieved by using a Sampling, a Perturbation, and a Filtering module.
|
||||
The Sampling module samples on an adaptive rate the aggregates to be perturbed.
|
||||
The Perturbation module adds noise to each sampled point according to the allocated privacy budget.
|
||||
The Filtering module receives the perturbed data point and the original one and generates a posterior estimate, which is finally released.
|
||||
The error between the perturbed and the released (posterior estimate) point is used to adapt the sampling rate; the sampling frequency is increased when data is going through rapid changes and vice-versa.
|
||||
Thus, depending on the adjusted sampling rate, not every single data point is perturbed, saving in this way the available privacy budget.
|
||||
While the system considers the temporal correlations of the processed time series, it does not attempt to deal with the privacy threat that they might pose.
|
||||
|
||||
% CTS-DP: publishing correlated time-series data via differential privacy}
|
||||
% - statistical (they use trajectories in the experiments)
|
||||
% - finite
|
||||
% - streaming
|
||||
% - dependence
|
||||
% - event
|
||||
% - differential privacy
|
||||
% - perturbation (correlated Laplace)
|
||||
% - serial correlations (autocorrelation function)
|
||||
\hypertarget{wang2017cts}{Wang and Zu}~\cite{wang2017cts} defined Correlated Time Series Differential Privacy (\emph{CTS-DP}).
|
||||
The scheme guarantees that the correlation between the perturbation that is introduced by a Correlated Laplace Mechanism (CLM), and the original time series is indistinguishable (Series-Indistinguishability).
|
||||
CTS-DP deals with the shortcomings of independent and identically distributed (i.i.d.) noise under the presence of correlations.
|
||||
I.i.d. noise offers inadequate protection, because refinement methods, e.g.,~filtering, can remove it.
|
||||
Most privacy-preserving methods choose to introduce more noise in the presence of strong correlations thus, diminishing the data utility.
|
||||
An original and a perturbed time series satisfy Series-Indistinguishability if their normalized autocorrelation functions are the same; hence, the two time series are indistinguishable and the published time series satisfies differential privacy as well.
|
||||
The authors consider the fact that, in signal processing, if an i.i.d. signal passes through a filter, which consists of a combination of adders and delayers, it becomes non-i.i.d.
|
||||
Hence, they design CLM, which uses four Gaussian white noise series passed through a linear system, to produce a correlated Laplace noise series according to the autocorrelation function of the original time series.
|
||||
Although the authors prove experimentally that the implementation of CLM outperforms the current state-of-the-art methods, they do not test its robustness against any filter, which they keep as future work.
|
||||
|
||||
|
||||
\subsection{Infinite observation}
|
||||
\label{subsec:statistical-infinite}
|
||||
|
||||
% Private and continual release of statistics
|
||||
% - statistical
|
||||
% - infinite
|
||||
% - streaming
|
||||
% - linkage
|
||||
% - event
|
||||
% - differential privacy
|
||||
% - perturbation (Laplace)
|
||||
\hypertarget{chan2011private}{Chan et al.}~\cite{chan2011private} designed continuous counting mechanisms for finite and infinite data processing and publishing, satisfying $\varepsilon$-differential privacy.
|
||||
Their main contribution lies in proposing the Binary and Hybrid mechanisms, which do not have any upper bound temporal requirements.
|
||||
The mechanisms rely on the release of intermediate partial sums of counts at consecutive timestamp intervals, called \emph{p-sums}, and the injection of noise drawn from a Laplace distribution.
|
||||
The Binary mechanism constructs a binary tree where each node corresponds to a p-sum, and adds noise to each released p-sum proportional to its corresponding length.
|
||||
The Hybrid mechanism publishes counts at sparse time intervals, i.e.,~timestamps that are a power of $2$.
|
||||
Both mechanisms offer event-level protection (pan-privacy) under single unannounced and continual announced intrusions by adding a certain amount of noise to every p-sum in memory.
|
||||
They can facilitate continual top-$k$ queries in recommendation systems, and multidimensional range queries.
|
||||
Furthermore, they are able to support applications that require a consistent output, i.e.,~at each timestamp the counter increases by either $0$ or $1$.
|
||||
|
||||
% Differentially private real-time data release over infinite trajectory streams
|
||||
% - statistical (spatial)
|
||||
% - infinite
|
||||
% - streaming
|
||||
% - linkage
|
||||
% - personalized w-event
|
||||
% - differential privacy
|
||||
% - perturbation (dynamic Laplace)
|
||||
\hypertarget{cao2015differentially}{Cao et al.}~\cite{cao2015differentially} developed a framework that achieves personalized \emph{l-trajectory} privacy protection by dynamically adding noise at each timestamp, which exponentially fades over time.
|
||||
Each individual can specify, in an array of size $l$, the desired protection level for each location of his/her trajectory.
|
||||
The proposed framework is composed of three components.
|
||||
The Dynamic Budget Allocation component allocates portions of the privacy budget to the other two components: a fixed one to the Private Approximation, and a dynamic one to the Private Publishing component at each timestamp.
|
||||
The Private Approximation component estimates, under a utility goal and an approximation strategy, whether it is beneficial to publish approximate data or not.
|
||||
More precisely, it chooses an appropriate previous noisy data release and republishes it if it is similar to the real statistics planned to be published.
|
||||
The Private Publishing component takes as inputs the real statistics, and the timestamp of the approximate data, generated by the Private Approximation component, to be republished.
|
||||
If the timestamp of the approximate data is equal to the current timestamp, then the current data with Laplace noise are published.
|
||||
Otherwise, the data at the corresponding timestamp of the approximate data will be republished.
|
||||
The utilized approximation technique is highly suitable for streaming processing, due to the implementation of approximation that can reduce significantly the privacy budget consumption.
|
||||
However, the framework does not take into account privacy leakage stemming from data dependencies, which limits considerably its applicability in real life data sets.
|
||||
|
||||
% Private decayed predicate sums on streams
|
||||
% - statistical
|
||||
% - infinite
|
||||
% - streaming
|
||||
% - linkage
|
||||
% - w-event
|
||||
% - differential privacy
|
||||
% - perturbation (Laplace)
|
||||
\hypertarget{bolot2013private}{Bolot et al.}~\cite{bolot2013private} introduce the notion of \emph{decayed privacy} in continual observation of aggregates (sums).
|
||||
The authors recognize the fact that monitoring applications focus more on recent events, and data, therefore, the value of previous data releases exponentially fades.
|
||||
This leads to a schema of privacy with expiration, according to which, recent events, and data are more privacy sensitive than those preceding.
|
||||
Based on this, they apply decayed sum functions for answering sliding window queries of fixed window size $w$ on data streams.
|
||||
Namely, window sum compute the difference of two running sums, and exponentially decayed and polynomial decayed sums estimate the sum of decayed data.
|
||||
For every consecutive $w$ data points the algorithm generates binary trees where each node is perturbed with Laplace noise with scale proportional to $w$.
|
||||
Instead of maintaining a binary tree for every window, the algorithm considers the windows that span two blocks as the union of a suffix and a prefix of two consecutive trees.
|
||||
This way, the global sensitivity of the query function is kept low.
|
||||
The proposed techniques are designed for fixed window sizes, hence, when answering multiple sliding window queries with variable window sizes they have to distribute the available privacy budget accordingly.
|
||||
|
||||
% Differentially private event sequences over infinite streams
|
||||
% - statistical
|
||||
% - infinite
|
||||
% - streaming
|
||||
% - linkage
|
||||
% - w-event
|
||||
% - differential privacy
|
||||
% - perturbation (Laplace)
|
||||
Based on the notion of decayed privacy~\cite{bolot2013private}, \hypertarget{kellaris2014differentially}{Kellaris et al.}~\cite{kellaris2014differentially} defined $w$-event privacy in the setting of periodical release of statistics (counts) in infinite streams.
|
||||
To achieve $w$-event privacy, the authors propose two mechanisms (Budget Distribution, and Budget Absorption) based on sliding windows, which effectively distribute the privacy budget to sub-mechanisms (one sub-mechanism per timestamp) applied on the data of a window of the stream.
|
||||
Both algorithms may decide to publish a new noisy count for a specific timestamp, based on the similarity level of the current count with a previously published one.
|
||||
Moreover, both algorithms have the constraint that the total privacy budget consumed in a window is less than or equal to $\varepsilon$.
|
||||
The Budget Distribution algorithm distributes the privacy budget in an exponential-fading manner following the assumption that in a window most of the counts remain similar.
|
||||
The budget of expired timestamps becomes available for the next publications (of next windows).
|
||||
The Budget Absorption algorithm uniformly distributes from the beginning the budget to the window's timestamps.
|
||||
A publication uses not only the by-default allocated budget but also the budget of non-published timestamps.
|
||||
In order to not exceed the limit of $\varepsilon$, adequate number of subsequent timestamps are `silenced' after a publication takes place.
|
||||
Even though one can argue that $w$-event privacy could be achieved by user-level privacy, it is nevertheless non-practical because of the rigidity of the budget allocation that would finally render the output useless.
|
||||
|
||||
% RescueDP: Real-time spatio-temporal crowd-sourced data publishing with differential privacy
|
||||
% - statistical (spatial)
|
||||
% - infinite
|
||||
% - streaming
|
||||
% - linkage
|
||||
% - w-event
|
||||
% - differential privacy
|
||||
% - perturbation (dynamic Laplace)
|
||||
% - serial correlations (Pearson's r)
|
||||
\hypertarget{wang2016rescuedp}{Wang et al.}~\cite{wang2016rescuedp} propose \emph{RescueDP} for the publishing of real-time user-generated spatiotemporal data, utilizing differential privacy with $w$-event-level protection.
|
||||
RescueDP uses a Dynamic Grouping module to create clusters of regions with small statistics, i.e.,~areas with a small number of samples.
|
||||
It estimates the similarity of the data trends of these regions by utilizing the Pearson's correlation coefficient, and creates groups accordingly.
|
||||
The data of each group pass from a Perturbation module that injects Laplace noise to them.
|
||||
The grouping of the previous phase results into the increase of the sample size of each group of regions, which minimizes the error due to the noise injection.
|
||||
The implementation of a Kalman Filtering~\cite{kalman1960new} module further increases the utility of the released data.
|
||||
A Budget Allocation module distributes the available privacy budget to sampling points within any successive $w$ timestamps.
|
||||
RescueDP saves part of the available privacy budget by approximating the non-sampled data with previously released perturbed data.
|
||||
During the whole process, an Adaptive Sampling module adjusts the sampling interval according to the difference in the released data statistics over the previous timestamps while taking into account the remaining privacy budget.
|
||||
|
||||
% RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response
|
||||
% - statistical
|
||||
% - infinite
|
||||
% - streaming
|
||||
% - linkage
|
||||
% - user
|
||||
% - differential privacy
|
||||
% - randomization (randomized response)
|
||||
% - local
|
||||
\hypertarget{erlingsson2014rappor}{Erlingsson et al.}~\cite{erlingsson2014rappor} presented \emph{RAPPOR} (Randomized Aggregatable Privacy-Preserving Ordinal Response) as a solution for privacy-preserving collection of statistics.
|
||||
RAPPOR makes all the necessary data processing on the side of the data generators by applying the method of randomized response, which guarantees local differential privacy.
|
||||
The product of each local privacy-preserving processing is a report that can be represented as a bit string.
|
||||
Each bit corresponds to a randomized response to a logical predicate on an individual's personal data, e.g.,~categorical properties, numerical and ordinal values, or categories that cannot be enumerated.
|
||||
Initially, RAPPOR hashes a sensitive value into a Bloom filter~\cite{bloom1970space}.
|
||||
It creates a binary reporting value, which keeps in its memory (\emph{memoization}) and reuses for future reports (permanent randomized response).
|
||||
Memoization offers long-term longitudinal privacy protection for privacy-sensitive data values that do not change over time or that are not dependent.
|
||||
RAPPOR deals with tracking externalities by reporting a randomized version of the permanent randomized response (instantaneous randomized response).
|
||||
Although this adds an extra layer of randomization to the reported values, it might lead to an averaging attack that may allow an adversary to estimate the true value.
|
||||
Finally, the authors propose a decoding technique that involves grouping, least-squares solving, and regression.
|
||||
This way, they effectively make up for the loss of information due to the randomization of the previous steps and allow the extraction of useful information when observing the generated bit strings.
|
||||
They test their implementation with both simulated and real data, and show that they can extract statistics with high utility while preserving the privacy of the individuals involved.
|
||||
However, the fact that the privacy guarantees of their technique are valid only for stationary individuals that produce independent data on top of the relatively complex configuration, renders their proposal impractical for many real-world scenarios.
|
||||
|
||||
% PrivApprox: privacy-preserving stream analytics
|
||||
% - statistical
|
||||
% - infinite
|
||||
% - streaming
|
||||
% - linkage
|
||||
% - event
|
||||
% - zero-knowledge
|
||||
% - perturbation (randomized response)
|
||||
\hypertarget{quoc2017privapprox}{Le Quoc et al.}~\cite{quoc2017privapprox} propose \emph{PrivApprox}, a data analytics system for privacy-preserving stream processing of distributed data sets that combines sampling and randomized response.
|
||||
The system distributes the analysts' queries to clients via an aggregator and proxies, and employs sliding window computations over batched stream processing to handle the data stream generated by the clients.
|
||||
The clients transmit a randomized response, after sampling the locally available data, to the aggregator via proxies that apply (XOR-based) encryption.
|
||||
The combination of sampling and randomized response achieves \emph{zero-knowledge} based privacy, i.e.,~proving that they know a piece of information without in fact disclosing its actual value.
|
||||
The aggregator collects the received responses and returns statistics to the analysts.
|
||||
The query model expresses the responses of numerical queries as counts within histogram buckets, whereas, for non-numeric queries it specifies each bucket by a matching rule or a regular expression.
|
||||
A confidence metric quantifies the results' approximation from the sampling and randomization.
|
||||
PrivApprox achieves low latency stream processing and enables a synchronization-free distributed architecture that requires low trust to a central entity.
|
||||
Since it implements a sliding window methodology for infinitely processing series of data sets, it would be purposeful to investigate how to achieve $w$-event-level privacy protection.
|
||||
|
||||
% Hiding in the crowd: Privacy preservation on evolving streams through correlation tracking
|
||||
% - statistical
|
||||
% - infinite
|
||||
% - streaming
|
||||
% - data dependence
|
||||
% - event
|
||||
% - randomization
|
||||
% - perturbation (dynamic)
|
||||
% - serial correlations (data trends)
|
||||
\hypertarget{li2007hiding}{Li et al.}~\cite{li2007hiding} attempt to tackle the problem of privacy preservation in numerical data streams taking into account the correlations that may appear continuously among multiple streams and within each one of them.
|
||||
Firstly, the authors define the utility, and privacy specifications.
|
||||
The utility of a perturbed data stream is the inverse of the discrepancy between the original and the perturbed measurements.
|
||||
The discrepancy is set as the normalized Forbenius norm, i.e.,~a matrix norm defined as the square root of the sum of the absolute squares of its elements.
|
||||
Privacy corresponds to the discrepancy between the original and the reconstructed data stream (from the perturbed one), and consists of the removed noise and the error introduced by the reconstruction.
|
||||
Then, correlations come into play.
|
||||
The system continuously monitors the data streams for trends to track correlations, and dynamically perturbs the original numerical data while maintaining the trends that are present.
|
||||
More specifically, the Streaming Correlated Additive Noise (SCAN) module updates the estimation of the local principal components of the original data, and proportionally distributes noise along the components. Thereafter, the Streaming Correlation Online Reconstruction (SCOR) module removes all the noise by utilizing the best linear reconstruction.
|
||||
SCOR is a representation of the ability of any adversarial entity to post-process the released data and attempt to reconstruct the original data set by filtering out any distortion.
|
||||
Overall, the present technique offers robustness against inference attacks by adapting randomization according to data trends, but fails to efficiently quantify the overall privacy guarantee.
|
||||
|
||||
% PeGaSus: Data-Adaptive Differentially Private Stream Processing
|
||||
% - statistical
|
||||
% - infinite
|
||||
% - streaming
|
||||
% - linkage
|
||||
% - event
|
||||
% - differential privacy
|
||||
% - perturbation (Laplace)
|
||||
\hypertarget{chen2017pegasus}{Chen et al.}~\cite{chen2017pegasus} developed \emph{PeGaSus}, an algorithm for event-level differentially private stream processing that supports different categories of stream queries (counts, sliding window, and event monitoring) over multiple stream resolutions.
|
||||
It consists of a Perturber, a Grouper, and a Smoother modules.
|
||||
The Perturber consumes the incoming data stream, adds noise $\varepsilon_p$, which is part of the available privacy budget $\varepsilon$ to each data item, and outputs a stream of noisy data.
|
||||
The data-adaptive Grouper consumes the original stream and partitions the data into well-approximated regions using, also part of the available privacy budget, $\varepsilon_g$.
|
||||
Finally, a query specific Smoother combines the independent information produced by the Perturber and the Grouper, and performs post-processing by calculating the final estimates of the Perturber's values for each partition created by the Grouper at each timestamp.
|
||||
The combination of the Perturber and the Grouper follows the sequential composition and post-processing properties of differential privacy, thus, the resulting algorithm satisfies ($\varepsilon_p + \varepsilon_g$)-differential privacy.
|
51
text/titlepage.tex
Normal file
51
text/titlepage.tex
Normal file
@ -0,0 +1,51 @@
|
||||
\begin{titlepage}
|
||||
|
||||
\centering
|
||||
|
||||
{\huge \thetitle\\}
|
||||
\vspace{6em}
|
||||
|
||||
{PRÉSENTÉE LE ** *** ****\\}
|
||||
\vspace{1em}
|
||||
|
||||
{À LA CY TECH - SCIENCES ET TECHNIQUES\\}
|
||||
{EQUIPES TRAITEMENT DE L'INFORMATION ET SYSTÈMES (ETIS)\\}
|
||||
{PROGRAMME DOCTORAL EN SCIENCES ET TECHNOLOGIES DE L'INFORMATION ET DE LA COMMUNICATION (STIC)\\}
|
||||
\vspace{2em}
|
||||
|
||||
{\Large CY CERGY PARIS UNIVERSITÉ\\}
|
||||
\vspace{2em}
|
||||
|
||||
{POUR L’OBTENTION DU GRADE DE DOCTEUR ÈS SCIENCES\\}
|
||||
\vspace{4em}
|
||||
|
||||
{\Large PAR\\}
|
||||
\vspace{2em}
|
||||
{\Large Manos KATSOMALLOS\\}
|
||||
\vfill
|
||||
|
||||
{acceptée sur proposition du jury :\\
|
||||
\vspace{1em}
|
||||
Prof. Dimitris Kotzinos, directeur de thèse\\
|
||||
Prof. Katerina Tzompanaki, co-encadrante de thèse\\
|
||||
Dr. *** ***, rapporteur\\
|
||||
Dr. *** ***, examinateur\\}
|
||||
\vfill
|
||||
|
||||
% Bottom of the page
|
||||
\begin{minipage}{\linewidth}
|
||||
\centering
|
||||
\raisebox{-.5\height}{\includegraphics[width=.125\linewidth]{logos/etis}}
|
||||
\qquad
|
||||
\raisebox{-.5\height}{\includegraphics[width=.125\linewidth]{logos/cyu}}
|
||||
\qquad
|
||||
\raisebox{-.5\height}{\includegraphics[width=.125\linewidth]{logos/ensea}}
|
||||
\qquad
|
||||
\raisebox{-.5\height}{\includegraphics[width=.125\linewidth]{logos/cnrs}}
|
||||
\end{minipage}
|
||||
\vspace{.5em}
|
||||
\\
|
||||
{France\\
|
||||
****}
|
||||
|
||||
\end{titlepage}
|
Reference in New Issue
Block a user