diff --git a/background.tex b/background.tex index 82562f3..68454c8 100644 --- a/background.tex +++ b/background.tex @@ -1,7 +1,584 @@ \chapter{Background} \label{ch:bg} -This is the background. +\section{Background} +\label{sec:background} + +In this section, we introduce some relevant terminology and background knowledge around the problem of continuous publishing of sensitive data sets. +First, we categorize data as we view them in the context of continuous data publishing. +Second, we define data privacy, we list the kinds of attacks that have been identified in the literature, as well as the desired privacy levels that can be achieved, and the basic privacy operations that are applied to achieve data privacy. +Third, we provide a brief overview of the seminal works on privacy-preserving data publishing, used also in continuous data publishing, fundamental in the domain and important for the understanding of the rest of the survey. + +To accompany and facilitate the descriptions in this section, we provide the following running example. + +\begin{example} + \label{ex:snapshot} + Users interact with an LBS by making queries in order to retrieve some useful location-based information or just reporting user-state at various locations. + This user--LBS interaction generates user-related data, organized in a schema with the following attributes: \emph{Name} (the unique identifier of the table), \emph{Age}, \emph{Location}, and \emph{Status} (Table~\ref{tab:snapshot-micro}). + The `Status' attribute includes information that characterizes the user's state or the query itself, and its value varies according to the service functionality. + Subsequently, the generated data are aggregated (by issuing count queries over them) in order to derive useful information about the popularity of the venues during the day (Table~\ref{tab:snapshot-statistical}). + + \begin{table} + \centering\hspace{\fill} + \subcaptionbox{Microdata\label{tab:snapshot-micro}}{% + \begin{tabular}{@{}lrll@{}} + \toprule + \textit{Name} & \multicolumn{1}{c}{Age} & Location & Status \\ + \midrule + Donald & $27$ & Le Marais & at work \\ + Daisy & $25$ & Belleville & driving \\ + Huey & $12$ & Montmartre & running \\ + Dewey & $11$ & Montmartre & at home \\ + Louie & $10$ & Latin Quarter & walking \\ + Quackmore & $62$ & Opera & dining \\ + \bottomrule + \end{tabular}% + }\hspace{\fill} + \subcaptionbox{Statistical data\label{tab:snapshot-statistical}}{% + \begin{tabular}{@{}lr@{}} + \toprule + Location & \multicolumn{1}{c@{}}{Count} \\ + \midrule + Belleville & $1$ \\ + Latin Quarter & $1$ \\ + Le Marais & $1$ \\ + Montmartre & $2$ \\ + Opera & $1$ \\ + \bottomrule + \\ + \end{tabular}% + }\hspace{\fill} + \caption{Example of raw user-generated (a)~microdata, and related (b)~statistical data for a specific timestamp.} + \label{tab:snapshot} + \end{table} +\end{example} + + +\subsection{Data} +\label{subsec:data} + + +\subsubsection{Categories} +\label{subsec:data-categories} + +As this survey is about privacy, the data that we are interested in, contain information about individuals and their actions. +We firstly classify the data based on their content: + +\begin{itemize} + \item \emph{Microdata}---the data items in their raw, usually tabular, form pertaining to individuals or objects. + \item \emph{Statistical data}---the outcome of statistical processes on microdata. +\end{itemize} + +An example of microdata is displayed in Table~\ref{tab:snapshot-micro}, while an example of statistical data in Table~\ref{tab:snapshot-statistical}. +Data, in either of these two forms, may have a special property called~\emph{continuity}, i.e.,~their values change and can be observed through time. +Depending on the span of observation, we distinguish the following categories: + +\begin{itemize} + \item \emph{Finite data}---data are observed during a predefined time interval. + \item \emph{Infinite data}---data are observed in an uninterrupted fashion. +\end{itemize} + +\begin{example} + \label{ex:continuous} + Extending Example~\ref{ex:snapshot}, Table~\ref{tab:continuous} shows an example of continuous data observation, by introducing one data table for each consecutive timestamp. + The two data tables, over the time-span $[t_1, t_2]$ are an example of finite data. + Infinite data are the whole series of data obtained over the period~$[t_1, \infty)$ (infinity is denoted by `\dots'). + + \begin{table} + \centering + \subcaptionbox{Microdata\label{tab:continuous-micro}}{% + \adjustbox{max width=\linewidth}{% + \begin{tabular}{@{}ccc@{}} + \begin{tabular}{@{}lrll@{}} + \toprule + \textit{Name} & \multicolumn{1}{c}{Age} & Location & Status \\ + \midrule + Donald & $27$ & Le Marais & at work \\ + Daisy & $25$ & Belleville & driving \\ + Huey & $12$ & Montmartre & running \\ + Dewey & $11$ & Montmartre & at home \\ + Louie & $10$ & Latin Quarter & walking \\ + Quackmore & $62$ & Opera & dining \\ + \bottomrule + \multicolumn{4}{c}{$t_1$} \\ + \end{tabular} & + \begin{tabular}{@{}lrll@{}} + \toprule + \textit{Name} & \multicolumn{1}{c}{Age} & Location & Status \\ + \midrule + Donald & $27$ & Montmartre & driving \\ + Daisy & $25$ & Montmartre & at the mall \\ + Huey & $12$ & Latin Quarter & sightseeing \\ + Dewey & $11$ & Opera & walking \\ + Louie & $10$ & Latin Quarter & at home \\ + Quackmore & $62$ & Montmartre & biking \\ + \bottomrule + \multicolumn{4}{c}{$t_2$} \\ + \end{tabular} & + \dots + \end{tabular}% + }% + } \\ \bigskip + \subcaptionbox{Statistical data\label{tab:continuous-statistical}}{% + \begin{tabular}{@{}lrrr@{}} + \toprule + \multirow{2}{*}{Location} & \multicolumn{3}{c@{}}{Count}\\ + & \multicolumn{1}{c}{$t_1$} & \multicolumn{1}{c}{$t_2$} & \dots \\ + \midrule + Belleville & $1$ & $0$ & \dots \\ + Latin Quarter & $1$ & $2$ & \dots \\ + Le Marais & $1$ & $0$ & \dots \\ + Montmartre & $2$ & $3$ & \dots \\ + Opera & $1$ & $1$ & \dots \\ + \bottomrule + \end{tabular}% + }% + \caption{Continuous data observation of (a)~microdata, and corresponding (b)~statistics at multiple timestamps.} + \label{tab:continuous} + \end{table} +\end{example} + +We further define two sub-categories applicable to both finite and infinite data: \emph{sequential} and \emph{incremental} data; these two subcategories are not exhaustive, i.e.,~not all data sets belong to the one or the other category. +In sequential data, the value of the observed variable changes, depending on its previous value. +For example, trajectories are finite sequences of location stamps, as naturally the position at each timestamp is connected to the position at the previous timestamp. +In incremental data, an original data set is augmented in each subsequent timestamp with supplementary information. +For example, trajectories can be considered as incremental data, when at each timestamp we consider all the previously visited locations by an individual, incremented by his current position. + + +\subsubsection{Processing and publishing} +\label{subsec:data-publishing} + +We categorize data processing and publishing based on the implemented scheme, as: + +\begin{itemize} + \item \emph{Global}---data are collected, processed and privacy-protected, and then published by a central (trusted) entity, e.g.,~\cite{mcsherry2009privacy, blocki2013differentially, johnson2018towards}. + \item \emph{Local}---data are stored, processed and privacy-protected on the side of data generators before sending them to any intermediate or final entity, e.g.,~\cite{andres2013geo, erlingsson2014rappor, katsomallos2017open}. +\end{itemize} + +\begin{figure}[htp] + \centering + \subcaptionbox{Global scheme\label{fig:scheme-global}}{% + \includegraphics[width=\linewidth]{scheme-global}% + } \\ \bigskip + \subcaptionbox{Local scheme\label{fig:scheme-local}}{% + \includegraphics[width=\linewidth]{scheme-local}% + } + \caption{The usual flow of user-generated data, optionally harvested by data publishers, privacy-protected, and released to data consumers, according to the (a)~global, and (b)~local privacy schemes.} + \label{fig:privacy-schemes} +\end{figure} + +In the case of location data privacy, the existing literature is divided in +\emph{service-} and \emph{data-}centric methods~\cite{chow2011trajectory}. +The service-centric methods correspond to scenarios where individuals share their privacy-protected location with a service to get some relevant information (local publishing scheme). +The data-centric methods relate to the publishing of user-generated data to data consumers (global publishing scheme). + +There is a long-standing debate whether the local or the global architectural scheme is more efficient with respect to not only privacy, but also organizational, economic, and security factors~\cite{king1983centralized}. +On the one hand, in the global privacy scheme (Figure~\ref{fig:scheme-global}), the dependence on third-party entities poses the risk of arbitrary privacy leakage from a compromised data publisher. +Nonetheless, the expertise of these entities is usually superior to that of the majority of (non-technical) data generators' in terms of understanding privacy permissions/\allowbreak policies and setting-up relevant preferences. +Moreover, in the global architecture, less distortion is necessary before publicly releasing the aggregated data set, naturally because the data sets are larger and users can be `hidden' more easily. +On the other hand, the local privacy scheme (Figure~\ref{fig:scheme-local}) facilitates fine-grained data management, offering to every individual better control over their data~\cite{goldreich1998secure}. +Nonetheless, data distortion at an early stage might prove detrimental to the overall utility of the aggregated data set. +The so far consensus is that there is no overall optimal solution among the two designs. +Most service-providing companies prefer the global scheme, mainly for reasons of better management and control over the data, while several privacy advocates support the local privacy scheme that offers users full control over what and how data are published. +Although there have been attempts to bridge the gap between them, e.g.,~\cite{bittau2017prochlo}, the global scheme is considerably better explored and implemented~\cite{satyanarayanan2017emergence}. +For this reason, most of the works in this survey span this context. + +We distinguish between two publishing modes for private data: \emph{snapshot} and \emph{continuous}. +In snapshot publishing (also appearing as \emph{one-shot} or \emph{one-off} publishing), the system processes and releases a data set at a specific point in time and thereafter is not concerned anymore with the specific data set. +For example, in Figure~\ref{fig:mode-snapshot} (ignore the privacy-preserving step for the moment) individuals send their data to an LBS provider, considering a specific time point. +In continuous data publishing the system computes, and publishes augmented or updated versions of one data set in different time points, and without a predefined duration. +In the context of privacy-preserving data publishing, privacy preservation is tightly coupled with the data processing and publishing stages. + +As already discussed in Section~\ref{ch:intro}, in this survey we are studying the continuous data publishing mode, and thus we do not include works considering the snapshot paradigm. +We make this deliberate choice as privacy-preserving continuous data publishing is a more complex problem, receiving more and more attention from the scientific community in the recent years, as shown by the increasing number of publications in this area. +Moreover, the use cases of continuous data publishing abound, with the proliferation of the Internet, sensors, and connected devices, which produce and send to servers huge amounts of continuous personal data in astounding speed. + +We identify two main data processing and publishing modes: + +\begin{itemize} + \item \emph{Batch}---data are considered in groups in specific time intervals. + \item \emph{Streaming}---data are considered per timestamp, infinitely. +\end{itemize} + +\begin{figure}[htp] + \centering + \subcaptionbox{Snapshot mode\label{fig:mode-snapshot}}{% + \includegraphics[width=.4\linewidth]{mode-snapshot}% + } \\ \bigskip\hspace{\fill} + \subcaptionbox{Batch mode\label{fig:mode-batch}}{% + \includegraphics[width=.4\linewidth]{mode-batch}% + }\hspace{\fill} + \subcaptionbox{Streaming mode\label{fig:mode-streaming}}{% + \includegraphics[width=.4\linewidth]{mode-streaming}% + }\hspace{\fill} + \caption{The different data processing and publishing modes of continuously generated data sets. + (a)~Snapshot publishing, (b)~continuous publishing--batch mode, and (c)~continuous publishing--streaming mode. + $\pmb{o}_x$ denotes the privacy-protected version of the data set $D_x$ or statistics thereof, while `\dots' denote the continuous data generation and/or publishing, where applicable. + Depending on the data observation span, $n$ can either be finite or tend to infinity.} + \label{fig:privacy-modes} +\end{figure} + +Batch data processing and publishing (Figure~\ref{fig:mode-batch}) is performed (usually offline) over both finite and infinite data, while streaming processing and publishing (Figure~\ref{fig:mode-streaming}) is by definition connected to infinite data (usually in real-time). + + +\subsection{Privacy} +\label{subsec:privacy} + +When personal data are publicly released, either as microdata or statistical data, individuals' privacy can be compromised, i.e,~an adversary becomes certain about an individual's personal information with a probability higher than a desired threshold. +In the literature, this compromise is know as \emph{information disclosure} and is usually categorized as~\cite{li2007t, wang2010privacy, narayanan2008robust}: + +\begin{itemize} + \item \emph{Presence disclosure}---the participation (or absence) of an individual in a data set is revealed. + \item \emph{Identity disclosure}---an individual is linked to a particular record. + \item \emph{Attribute disclosure}---new information (attribute value) about an individual is revealed. +\end{itemize} + +In the literature, identity disclosure is also referred to as \emph{record linkage}, and presence disclosure as \emph{table linkage}. +Notice that identity disclosure can result in attribute disclosure, and vice versa. + +To better illustrate these definitions, we provide some examples based on Table~\ref{tab:snapshot}. +Presence disclosure appears when by looking at the (privacy-protected) counts of Table~\ref{tab:snapshot-statistical}, we can guess if Quackmore has participated in Table~\ref{tab:snapshot-micro}. +Identity disclosure appears when we can guess that the sixth record of (a privacy-protected version of) the microdata of Table~\ref{tab:snapshot-micro} belongs to Quackmore. +Attribute disclosure appears when it is revealed from (a privacy-protected version of) the microdata of Table~\ref{tab:snapshot-micro} that Quackmore is $62$ years old. + + +\subsubsection{Levels} +\label{subsec:privacy-levels} + +The information disclosure that a data release may entail is often linked to the protection level that a privacy-preserving algorithm is trying to achieve. +More specifically, in continuous data publishing the privacy protection level is considered with respect to not only the users but also to the \emph{events} occurring in the data. +An event is considered as a pair of an identifying attribute of an individual and the sensitive data (including contextual information), and can be seen as a correspondence to a record in a database, where each individual may participate once. +Data publishers typically release events in the form of data points' sequences usually indexed in time order (time series), and geotagged, e.g.,~(`Dewey', `at home at Montmartre at $t_1$'), \dots, (`Quackmore', `dining at Opera at $t_1$'). +The term `users' is used to refer to the \emph{individuals}, also known as \emph{participants}, who are the source of the processed and published data. +Therefore, they should not be confused with the consumers of the released data sets. +Users are subject to privacy attacks, and thus are the main point of interest of privacy protection mechanisms. +In more detail, the privacy protection levels are: + +\begin{itemize} + \item \emph{Event}~\cite{dwork2010differential, dwork2010pan}---\emph{any single event} of any individual is protected. + \item \emph{User}~\cite{dwork2010differential, dwork2010pan}---\emph{all the events} of any individual, spanning the observed event sequence, are protected. + \item \emph{$w$-event}~\cite{kellaris2014differentially}---\emph{any sequence of $w$ events}, within the released series of events, of any individual is protected. +\end{itemize} + +Figure~\ref{fig:privacy-levels} demonstrates the application of the possible protection levels on the statistical data of Example~\ref{ex:continuous}. +For instance, in event-level (Figure~\ref{fig:level-event}) it is hard to determine whether Quackmore was dining at Opera at $t_1$. +Moreover, in user-level (Figure~\ref{fig:level-user}) it is hard to determine whether Quackmore was ever included in the released series of events at all. +Finally, in $2$-event-level (Figure~\ref{fig:level-w-event}) it is hard to determine whether Quackmore was ever included in the released series of events between the timestamps $t_1$ and $t_2$, $t_2$ and $t_3$, etc. (i.e.,~for a window $w = 2$). + +\begin{figure}[htp] + \centering + \hspace{\fill}\subcaptionbox{Event-level\label{fig:level-event}}{% + \includegraphics[width=.32\linewidth]{level-event}% + }\hspace{\fill} + \subcaptionbox{User-level\label{fig:level-user}}{% + \includegraphics[width=.32\linewidth]{level-user}% + }\hspace{\fill} + \subcaptionbox{$2$-event-level\label{fig:level-w-event}}{% + \includegraphics[width=.32\linewidth]{level-w-event}% + }\hspace{\fill} + \caption{Protecting the data of Table~\ref{tab:continuous-statistical} on (a)~event-, (b)~user-, and (c)~$2$-event-level. A suitable distortion method can be applied accordingly.} + \label{fig:privacy-levels} +\end{figure} + +Contrary to event-level that provides privacy guarantees for a single event, user- and $w$-event-level offer stronger privacy protection by protecting a series of events. +In use-cases that involve infinite data, event- and $w$-event-level attain an adequate balance between data utility and user privacy, whereas user-level is more appropriate when the span of data observation is predefined. +$w$-event- is narrower than user-level protection due to its sliding window processing methodology. +In the extreme cases where $w$ is set to either $1$ or to the size of the entire length of the series of events, $w$-event- matches event- or user-level protection, respectively. +Although the described levels have been coined in the context of \emph{differential privacy}~\cite{dwork2006calibrating}, a seminal privacy method that we will discuss in more detail in Section~\ref{subsec:privacy-statistical}, it is possible to apply their definitions to other privacy protection techniques as well. + + +\subsubsection{Attacks} +\label{subsec:privacy-attacks} + +Information disclosure is typically achieved by combining supplementary (background) knowledge with the released data or by setting unrealistic assumptions while designing the privacy-preserving algorithms. +In its general form, this is known as \emph{adversarial} or \emph{linkage} attack. +Even though many works directly refer to the general category of linkage attacks, we distinguish also the following sub-categories, addressed in the literature: + +\begin{itemize} + \item \emph{Sensitive attribute domain} knowledge. + Here we can identify \emph{homogeneity and skewness} attacks~\cite{machanavajjhala2006diversity,li2007t}, when statistics of the sensitive attribute values are available, and \emph{similarity attack}, when semantics of the sensitive attribute values are available. + \item \emph{Complementary release} attacks~\cite{sweeney2002k} with regard to previous releases of different versions of the same and/or related data sets. + In this category, we also identify the \emph{unsorted matching} attack~\cite{sweeney2002k}, which is achieved when two privacy-protected versions of an original data set are published in the same tuple ordering. + Other instances include: (i)~the \emph{join} attack~\cite{wang2006anonymizing}, when tuples can be identified by joining (on the (quasi-)identifiers) several releases, (ii)~the \emph{tuple correspondence} attack~\cite{fung2008anonymity}, when in case of incremental data certain tuples correspond to certain tuples in other releases, in an injective way, (iii)~the \emph{tuple equivalence} attack~\cite{he2011preventing}, when tuples among different releases are found to be equivalent with respect to the sensitive attribute, and (iv)~the \emph{unknown releases} attack~\cite{shmueli2015privacy}, when the privacy preservation is performed without knowing the previously privacy-protected data sets. + \item \emph{Data dependence} + \begin{itemize} + \item within one data set. + Data tuples and data values within a data set may be correlated, or linked in such a way that information about one person can be inferred even if the person is absent from the database. + Consequently, in this category we put assumptions made on the data generation model based on randomness, like the random world model, the independent and identically distributed data (i.i.d.) model, or the independent-tuples model, which may be unrealistic for many real-world scenarios. + This attack is also known as the \emph{deFinetti's attack}~\cite{kifer2009attacks}. + \item among one data set and previous data releases, and/or other external sources~\cite{kifer2011no, chen2014correlated, liu2016dependence, zhao2017dependent}. + The strength of the dependence between a pair of variables can be quantified with the utilization of \emph{correlations}~\cite{stigler1989francis}. + Correlation implies dependence but not vice versa, however, the two terms are often used as synonyms. + The correlation among nearby observations, i.e.,~the elements in a series of data points, are referenced as \emph{autocorrelation} or \emph{serial correlation}~\cite{park2018fundamentals}. + Depending on the evaluation technique, e.g.,~\emph{Pearson's correlation coefficient}~\cite{stigler1989francis}, a correlation can be characterized as \emph{negative}, \emph{zero}, or \emph{positive}. + A negative value shows that the behavior of one variable is the \emph{opposite} of that of the other, e.g.,~when the one increases the other decreases. + Zero means that the variables are not linked and are \emph{independent} of each other. + A positive correlation indicates that the variables behave in a \emph{similar} manner, e.g.,~when the one decreases the other decreases as well. + + The most prominent types of correlations might be: + \begin{itemize} + \item \emph{Temporal}~\cite{wei2006time}---appearing in observations (i.e.,~values) of the same object over time. + \item \emph{Spatial}~\cite{legendre1993spatial, anselin1995local}---denoted by the degree of similarity of nearby data points in space, and indicating if and how phenomena relate to the (broader) area where they take place. + \item \emph{Spatiotemporal}---a combination of the previous categories, appearing when processing time series or sequences of human activities with geolocation characteristics, e.g.,~\cite{ghinita2009preventing}. + \end{itemize} + Contrary to one-dimensional correlations, spatial correlation is multi-dimensional and multi-directional, and can be measured by indicators (e.g.,~\emph{Moran's I}~\cite{moran1950notes}) that reflect the \emph{spatial association} of the concerned data. + Spatial autocorrelation has its foundations in the \emph{First Law of Geography} stating that ``everything is related to everything else, but near things are more related than distant things''~\cite{tobler1970computer}. + A positive spatial autocorrelation indicates that similar data are \emph{clustered}, a negative that data are dispersed and are close to dissimilar ones, and when close to zero, that data are \emph{randomly arranged} in space. + \end{itemize} + + A common practice for extracting data dependencies from continuous data, is by expressing the data as a \emph{stochastic} or \emph{random process}. + A random process is a collection of \emph{random variables} or \emph{bivariate data}, indexed by some set, e.g.,~a series of timestamps, a Cartesian plane $\mathbb{R}^2$, an $n$-dimensional Euclidean space, etc.~\cite{skorokhod2005basic}. + The values a random variable can take are outcomes of an unpredictable process, while bivariate data are pairs of data values with a possible association between them. + Expressing data as stochastic processes allows their modeling depending on their properties, and thereafter the discovery of relevant data dependencies. + Some common stochastic processes modeling techniques include: + + \begin{itemize} + \item \emph{Conditional probabilities}~\cite{allan2013probability}---probabilities of events in the presence of other events. + \item \emph{Conditional Random Fields} (CRFs)~\cite{lafferty2001conditional}---undirected graphs encoding conditional probability distributions. + \item \emph{Markov processes}~\cite{rogers2000diffusions}---stochastic processes for which the conditional probability of their future states depends only on the present state and it is independent of its previous states (\emph{Markov assumption}). + \begin{itemize} + \item \emph{Markov chains}~\cite{gagniuc2017markov}---sequences of possible events whose probability depends on the state attained in the previous event. + \item \emph{Hidden Markov Models} (HMMs)~\cite{baum1966statistical}---statistical Markov models of Markov processes with unobserved states. + \end{itemize} + \end{itemize} + +\end{itemize} + +The first sub-category of attacks has been mainly addressed in works on snapshot microdata publishing, and is still present in continuous publishing; however, algorithms for continuous publishing typically accept the proposed solutions for the snapshot publishing scheme (see discussion over $k$-anonymity and $l$-diversity in Section~\ref{subsec:privacy-seminal}). +This kind of attacks is tightly coupled with publishing the (privacy-protected) sensitive attribute value. +An example is the lack of diversity in the sensitive attribute domain, e.g.,~if all users in the data set of Table~\ref{tab:snapshot-micro} shared the same \emph{running} Status (the sensitive attribute). +The second and third subcategory are attacks emerging (mostly) in continuous publishing scenarios. +Consider again the data set in Table~\ref{tab:snapshot-micro}. +The complementary release attack means that an adversary can learn more things about the individuals (e.g.,~that there are high chances that Donald was at work) if he/she combines the information of two privacy-protected versions of this data set. +By the data dependence attack, the status of Donald could be more certainly inferred, by taking into account the status of Dewey at the same moment and the dependencies between Donald's and Dewey's status, e.g.,~when Dewey is at home, then most probably Donald is at work. +In order to better protect the privacy of Donald in case of attacks, the data should be privacy-protected in a more adequate way (than without the attacks). + + +\subsubsection{Operations} +\label{subsec:privacy-operations} + +Protecting private information, which is known by many names (obfuscation, cloaking, anonymization, etc.), is achieved by using a specific basic privacy protection operation. +Depending on the intervention that we choose to perform on the original data, we identify the following operations: + +\begin{itemize} + \item \emph{Aggregation}---group together multiple rows of a data set to form a single value. + \item \emph{Generalization}---replace an attribute value with a parent value in the attribute taxonomy. + Notice that a step of generalization, may be followed by a step of \emph{specialization}, to improve the quality of the resulting data set. + \item \emph{Suppression}---delete completely certain sensitive values or entire records. + \item \emph{Perturbation}---disturb the initial attribute value in a deterministic or probabilistic way. + The probabilistic data distortion is referred to as \emph{randomization}. +\end{itemize} + +For example, consider the table schema \emph{User(Name, Age, Location, Status)}. +If we want to protect the \emph{Age} of the user by aggregation, we may replace it by the average age in her Location; by generalization, we may replace the Age by age intervals; by suppression we may delete the entire table column corresponding to \emph{Age}; by perturbation, we may augment each age by a predefined percentage of the age; by randomization we may randomly replace each age by a value taken from the probability density function of the attribute. + +It is worth mentioning that there is a series of algorithms (e.g.,~\cite{benaloh2009patient, kamara2010cryptographic, cao2014privacy}) based on the \emph{cryptography} operation. +However, the majority of these methods, among other assumptions that they make, have minimum or even no trust to the entities that handle the personal information. +Furthermore, the amount and the way of data processing of these techniques usually burden the overall procedure, deteriorate the utility of the resulting data sets, and restrict their applicability. +Our focus is limited to techniques that achieve a satisfying balance between both participants' privacy and data utility. +For these reasons, there will be no further discussion around this family of techniques in this article. + + +\subsubsection{Seminal works} +\label{subsec:privacy-seminal} + +For completeness, in this section we present the seminal works for privacy-preserving data publishing, which, even though originally designed for the snapshot publishing scenario, have paved the way, since many of the works in privacy-preserving continuous publishing are based on or extend them. + + +\paragraph{Microdata} +\label{subsec:privacy-micro} + +Sweeney coined \emph{$k$-anonymity}~\cite{sweeney2002k}, one of the first established works on data privacy. +A released data set features $k$-anonymity protection when the sequence of values for a set of identifying attributes, called the \emph{quasi-identifiers}, is the same for at least $k$ records in the data set. +Computing the quasi-identifiers in a set of attributes is still a hard problem on its own~\cite{motwani2007efficient}. +$k$-anonymity is syntactic, it constitutes an individual indistinguishable from at least $k-1$ other individuals in the same data set. +In a follow-up work~\cite{sweeney2002achieving}, the author describes a way to achieve $k$-anonymity for a data set by the suppression or generalization of certain values of the quasi-identifiers. +Machanavajjhala et al.~\cite{machanavajjhala2006diversity} pointed out that $k$-anonymity is vulnerable to homogeneity and background knowledge attacks. +Thereby, they proposed \emph{$l$-diversity}, which demands that the values of the sensitive attributes are `well-represented' by $l$ sensitive values in each group. +Principally, a data set can be $l$-diverse by featuring at least $l$ distinct values for the sensitive field in each group (\emph{distinct} $l$-diversity). +Other instantiations demand that the entropy of the whole data set is greater than or equal to $\log(l)$ (\emph{entropy} $l$-diversity) or that the number of appearances of the most common sensitive value is less than the sum of the counts of the rest of the values multiplied by a user defined constant $c$ (\emph{recursive (c, l)}-diversity). +Later on, Li et al.~\cite{li2007t} indicated that $l$-diversity can be void by skewness and similarity attacks due to sensitive attributes with a small value range. +In such cases, \emph{$\theta$-closeness} guarantees that the distribution of a sensitive attribute in a group and the distribution of the same attribute in the whole data set is `similar'. +This similarity is bounded by a threshold $\theta$. +A data set features $\theta$-closeness when all of its groups feature $\theta$-closeness. + +The main drawback of $k$-anonymity (and its derivatives) is that it is not tolerant to external attacks of re-identification on the released data set. +The problems identified in~\cite{sweeney2002k} appear when attempting to apply $k$-anonymity on continuous data publishing (as we will also see next in Section~\ref{sec:micro}). +These attacks include multiple $k$-anonymous data set releases with the same record order, subsequent releases of a data set without taking into account previous $k$-anonymous releases, and tuple updates. +Proposed solutions include rearranging the attributes, setting the whole attribute set of previously released data sets as quasi-identifiers or releasing data based on previous $k$-anonymous releases. + + +\paragraph{Statistical data} +\label{subsec:privacy-statistical} + +While methods based on $k$-anonymity have been mainly employed for releasing microdata, \emph{differential privacy}~\cite{dwork2006calibrating} has been proposed for releasing high utility aggregates over microdata while providing semantic privacy guarantees. +Differential privacy is algorithmic, it ensures that any adversary observing a privacy-protected output, no matter his/her computational power or auxiliary information, cannot conclude with absolute certainty if an individual is included in the input data set. +Moreover, it quantifies and bounds the impact that the addition/removal of the data of an individual to/from an input data set has on the derived privacy-protected aggregates. + +In its formal definition, a \emph{privacy mechanism} $\mathcal{M}$, which outputs a query answer with some injected randomness, satisfies $\varepsilon$-differential privacy for a user-defined privacy budget $\varepsilon$~\cite{mcsherry2009privacy} if for all pairs of \emph{neighboring} (i.e.,~differing by the data of an individual) data sets $D$ and $D'$, it holds that: +$$\Pr[\mathcal{M}(D) \in O]\leq e^\varepsilon \Pr[\mathcal{M}(D') \in O],$$ + +\noindent where $\Pr[\cdot]$ denotes the probability of an event, and $O$ is the world of possible outputs of a mechanism $\mathcal{M}$. +As the definition implies, for low values of $\varepsilon$, $\mathcal{M}$ achieves stronger privacy protection since the probabilities of $D$ and $D'$ being true worlds are similar, but the utility of the mechanism's output is reduced since more randomness is introduced. +The privacy budget $\varepsilon$ has a non-zero and positive value, and is usually set to $0.01$, $0.1$, or, in some cases, $\ln2$ or $\ln3$~\cite{lee2011much}. + +A typical mechanism example is the \emph{Laplace mechanism}~\cite{dwork2014algorithmic}, which draws randomly a value from the probability distribution of $\textrm{Laplace}(\mu, b)$, where $\mu$ stands for the location parameter and $b > 0$ the scale parameter. +Here, $\mu$ is equal to the original output value of a query function, and $b$ is the sensitivity of the query function divided by $\varepsilon$. +The Laplace mechanism works for any function with range the set of real numbers. +A specialization of this mechanism for location data is the \emph{Planar Laplace mechanism}~\cite{andres2013geo}, which is based on a multivariate Laplace distribution. +For query functions that do not return a real number, e.g.,~`What is the most visited country this year?' or in cases where perturbing the value of the output will completely destroy its utility, e.g.,~`What is the optimal price for this auction?', most works in the literature use the \emph{Exponential mechanism}~\cite{dwork2014algorithmic}. +This mechanism utilizes a utility function $u$ that maps (input data set $D$, output value $r$) pairs to utility scores, and selects an output value $r$ from the input pairs, with probability proportional to $\exp(\frac{\varepsilon u(D, r)}{2\Delta u})$, +where $\Delta u$ is the sensitivity of the utility function. +Another technique for differential privacy mechanisms is the \emph{randomized response}~\cite{warner1965randomized}. +It is a privacy-preserving survey method that introduces probabilistic noise to the statistics of a research by randomly instructing respondents to answer truthfully or `Yes' to a sensitive, binary question. +The technique achieves this randomization by including a random event, e.g.,~the flip of a fair coin. +The respondents reveal to the interviewers only their answer to the question, and keep as a secret the result of the random event (i.e.,~if the coin was tails or heads). +Thereafter, the interviewers can calculate the probability distribution of the random event, e.g.,~$\frac{1}{2}$ heads and $\frac{1}{2}$ tails, and thus they can roughly eliminate the false responses and estimate the final result of the research. + +Differential privacy mechanisms satisfy two composability properties: \emph{sequential} and \emph{parallel}~\cite{mcsherry2009privacy, soria2016big}. +Due to the sequential composability property, the total privacy level of two independent mechanisms $\mathcal{M}_1$ and $\mathcal{M}_2$ over the same data set that satisfy $\varepsilon_1$ and $\varepsilon_2$, respectively, equals to $\varepsilon_1 + \varepsilon_2$. +The parallel composability property dictates that, when the mechanisms $\mathcal{M}_1$ and $\mathcal{M}_2$ are applied over disjoint subsets of the same data set, then the overall privacy level is $\max_{ i\in\{1,2\}}\varepsilon_i $. +Every time a data publisher interacts with (any part of) the original data set, it is mandatory to consume some of the available privacy budget according to the composability properties. +This is a necessity, so as to ensure that there will be no further arbitrary privacy loss, when the released data sets will be acquired by adversaries (or simple users). +However, \emph{post-processing} the output of a differential privacy mechanism can be done without using any additional privacy budget. +Naturally, using the same (or different) privacy mechanism(s) multiple times to interact with raw data in combination with already perturbed data, implies that the privacy guarantee of the final output will be calculated according to sequential composition. + +Differential privacy methods are best for low sensitivity queries such as counts, because the presence/\allowbreak absence of a single record can only change the result slightly. +However, sum and max queries can be problematic, since a single but very different value could change the output noticeably, making it necessary to add a lot of noise to the query's answer. +Furthermore, asking a series of queries may allow the disambiguation between possible data sets, making it necessary to add even more noise to the outputs. +For this reason, after a series of queries exhausts the available privacy budget, the data set has to be discarded. +Keeping the original guarantee across multiple queries that require different/\allowbreak new answers, one must inject noise proportional to the number of the executed queries, and thus destroying the utility of the output. + +A special category of differential privacy-preserving algorithms is that of \emph{pan-private} algorithms~\cite{dwork2010pan}. +Pan-private algorithms hold their privacy guarantees even when snapshots of their internal state (memory) are accessed during their execution by an external entity, e.g.,~subpena, security breach, etc. +There are two intrusion types that a data publisher has to deal with when designing a pan-private mechanism: \emph{single unannounced}, and \emph{continual announced} intrusion. +In the first, the data publisher assumes that the mechanism's state is observed by the external entity one unique time, without the data publisher ever being notified about it. +In the latter, the external entity gains access to the mechanism's state multiple times, and the publisher is notified after each time. +The simplest approach to deal with both cases is to make sure that the data in the memory of the mechanism have constantly the same distribution, i.e.,~they are differentially private. +Notice that this must hold throughout the mechanism's lifetime, even before/\allowbreak after it processes any sensitive data point(s). + +The notion of differential privacy has highly influenced the research community, resulting in many follow-up publications (\cite{mcsherry2007mechanism, kifer2011no, zhang2017privbayes} to mention a few). +We distinguish here \emph{Pufferfish}~\cite{kifer2014pufferfish} and \emph{geo-indistinguishability}~\cite{andres2013geo,chatzikokolakis2015geo}. +\emph{Pufferfish} is a framework that allows experts in an application domain, without necessarily having any particular expertise in privacy, to develop privacy definitions for their data sharing needs. +To define a privacy mechanism using \emph{Pufferfish}, one has to define a set of potential secrets $\mathcal{X}$, a set of distinct pairs $\mathcal{X}_{pairs}$, and auxiliary information about data evolution scenarios $\mathcal{B}$. +$\mathcal{X}$ serves as an explicit specification of what we would like to protect, e.g.,~`the record of an individual $x$ is (not) in the data'. +$\mathcal{X}_{pairs}$ is a subset of $\mathcal{X} \times \mathcal{X}$ that instructs how to protect the potential secrets $\mathcal{X}$, e.g.,~(`$x$ is in the table', `$x$ is not in the table'). +Finally, $\mathcal{B}$ is a set of conservative assumptions about how the data evolved (or were generated) that reflects the adversary's belief about the data, e.g.,~probability distributions, variable correlations, etc. +When there is independence between all the records in the original data set, then $\varepsilon$-differential privacy and the privacy definition of $\varepsilon$-\emph{Pufferfish}$(\mathcal{X}, \mathcal{X}_{pairs}, \mathcal{B})$ are equivalent. +\emph{Geo-indistinguishability} is an adaptation of differential privacy for location data in snapshot publishing. +It is based on $l$-privacy, which offers to individuals within an area with radius $r$, a privacy level of $l$. +More specifically, $l$ is equal to $\varepsilon r$ if any two locations within distance $r$ provide data with similar distributions. +This similarity depends on $r$ because the closer two locations are, the more likely they are to share the same features. +Intuitively, the definition implies that if an adversary learns the published location for an individual, the adversary cannot infer the individual's true location, out of all the points in a radius $r$, with a certainty higher than a factor depending on $l$. +The technique adds random noise drawn from a multivariate Laplace distribution to individuals' locations, while taking into account spatial boundaries and features. + +\begin{example} + \label{ex:application} + To illustrate the usage of the microdata and statistical data techniques for privacy-preserving data publishing, we revisit Example~\ref{ex:continuous}. + In this example, users continuously interact with an LBS by reporting their status at various locations. + Then, the reported data are collected by the central service, in order to be protected and then published, either as a whole, or as statistics thereof. + Notice that in order to showcase the straightforward application of $k$-anonymity and differential privacy, we apply the two methods on each timestamp independently from the previous one, and do not take into account any additional threats imposed by continuity. + + \begin{table} + \centering\noindent\adjustbox{max width=\linewidth} { + \begin{tabular}{@{}ccc@{}} + \begin{tabular}{@{}lrll@{}} + \toprule + \textit{Name} & \multicolumn{1}{c}{Age} & Location & Status \\ + \midrule + * & $> 20$ & Paris & at work \\ + * & $> 20$ & Paris & driving \\ + * & $> 20$ & Paris & dining \\ + \midrule + * & $\leq 20$ & Paris & running \\ + * & $\leq 20$ & Paris & at home \\ + * & $\leq 20$ & Paris & walking \\ + \bottomrule + \end{tabular} & + \begin{tabular}{@{}lrll@{}} + \toprule + \textit{Name} & \multicolumn{1}{c}{Age} & Location & Status \\ + \midrule + * & $> 20$ & Paris & driving \\ + * & $> 20$ & Paris & at the mall \\ + * & $> 20$ & Paris & biking \\ + \midrule + * & $\leq 20$ & Paris & sightseeing \\ + * & $\leq 20$ & Paris & walking \\ + * & $\leq 20$ & Paris & at home \\ + \bottomrule + \end{tabular} & + \dots \\ + $t_1$ & $ t_2$ & \\ + \end{tabular}% + }% + \caption{3-anonymous event-level protected versions of the microdata in Table~\ref{tab:continuous-micro}.} + \label{tab:scenario-micro} + \end{table} + + First, we anonymize the data set of Table~\ref{tab:continuous-micro} using $k$-anonymity, with $k = 3$. + This means that any user should not be distinguished from at least $2$ others. + Status is the sensitive attribute, thus the attribute that we wish to protect. + We start by suppressing the values of the Name attribute, which is the identifier. + The Age and Location attributes are the quasi-identifiers, so we proceed to adequately generalize them. + We turn age values to ranges ($\leq 20$, and $> 20$), and generalize location to city level (Paris). + Finally, we achieve $3$-anonymity by putting the entries in groups of three, according to the quasi-identifiers. + Table~\ref{tab:scenario-micro} depicts the results at each timestamp. + + \begin{table} + \centering + \subcaptionbox{True counts\label{tab:statistical-true}}{% + \begin{tabular}{@{}lr@{}} + \toprule + Location & \multicolumn{1}{c@{}}{Count} \\ + \midrule + Belleville & $1$ \\ + Latin Quarter & $1$ \\ + Le Marais & $1$ \\ + Montmartre & $2$ \\ + Opera & $1$ \\ + \bottomrule + \end{tabular}% + }\quad + \subcaptionbox*{}{% + \begin{tabular}{@{}c@{}} + \\ \\ \\ + $\xrightarrow[]{\text{Noise}}$ + \\ \\ \\ + \end{tabular}% + }\quad + \subcaptionbox{Perturbed counts\label{tab:statistical-noisy}}{% + \begin{tabular}{@{}lr@{}} + \toprule + Location & \multicolumn{1}{c@{}}{Count} \\ + \midrule + Belleville & $1$ \\ + Latin Quarter & $0$ \\ + Le Marais & $2$ \\ + Montmartre & $3$ \\ + Opera & $1$ \\ + \bottomrule + \end{tabular}% + }% + \caption{(a)~The original version of the data of Table~\ref{tab:continuous-statistical}, and (b)~their $1$-differentially event-level private version.} + \label{tab:scenario-statistical} + \end{table} + + Next, we demonstrate differential privacy. + We apply an $\varepsilon$-differentially private Laplace mechanism, with $\varepsilon = 1$, taking into account the count query that generated the true counts of Table~\ref{tab:continuous-statistical}. + The sensitivity of a count query is $1$ since the addition/removal of a tuple from the data set can change the final result of the query by maximum $1$ (tuple). + Figure~\ref{fig:laplace} shows how the Laplace distribution for the true count in Montmartre at $t_1$ looks like. + Table~\ref{tab:statistical-noisy} shows all the perturbed counts that are going to be released. + + \begin{figure}[htp] + \centering + \includegraphics[width=.7\linewidth]{laplace} + \caption{A Laplace distribution for location $\mu = 2$ and scale $b = 1$.} + \label{fig:laplace} + \end{figure} + +\end{example} + \section{Summary} diff --git a/graphics/level-event.pdf b/graphics/level-event.pdf new file mode 100644 index 0000000..4048e4d Binary files /dev/null and b/graphics/level-event.pdf differ diff --git a/graphics/level-user.pdf b/graphics/level-user.pdf new file mode 100644 index 0000000..de0918e Binary files /dev/null and b/graphics/level-user.pdf differ diff --git a/graphics/level-w-event.pdf b/graphics/level-w-event.pdf new file mode 100644 index 0000000..c20f30d Binary files /dev/null and b/graphics/level-w-event.pdf differ diff --git a/graphics/mode-batch.pdf b/graphics/mode-batch.pdf new file mode 100644 index 0000000..830f9b9 Binary files /dev/null and b/graphics/mode-batch.pdf differ diff --git a/graphics/mode-snapshot.pdf b/graphics/mode-snapshot.pdf new file mode 100644 index 0000000..b7c68d2 Binary files /dev/null and b/graphics/mode-snapshot.pdf differ diff --git a/graphics/mode-streaming.pdf b/graphics/mode-streaming.pdf new file mode 100644 index 0000000..47fbd7c Binary files /dev/null and b/graphics/mode-streaming.pdf differ diff --git a/graphics/scheme-global.pdf b/graphics/scheme-global.pdf new file mode 100644 index 0000000..3190647 Binary files /dev/null and b/graphics/scheme-global.pdf differ diff --git a/graphics/scheme-local.pdf b/graphics/scheme-local.pdf new file mode 100644 index 0000000..b36c0e4 Binary files /dev/null and b/graphics/scheme-local.pdf differ diff --git a/main.tex b/main.tex index 57394d8..8792f43 100644 --- a/main.tex +++ b/main.tex @@ -23,6 +23,7 @@ \usepackage[utf8]{inputenc} \usepackage{multirow} \usepackage{stmaryrd} +\usepackage{subcaption} \usepackage[normalem]{ulem} \usepackage[table]{xcolor} \usepackage{arydshln}