the-last-thing/text/preliminaries/data.tex

116 lines
8.5 KiB
TeX

\section{Data}
\label{sec:data}
\subsection{Categories}
\label{subsec:data-categories}
The data that we are interested in, contain information about individuals and their actions.
We firstly classify the data based on their content:
\begin{itemize}
\item \emph{Microdata}---the data items in their raw, usually tabular, form pertaining to individuals or objects.
\item \emph{Statistical data}---the outcome of statistical processes on microdata.
\end{itemize}
An example of microdata is displayed in Table~\ref{tab:snapshot-micro}, while an example of statistical data in Table~\ref{tab:snapshot-statistical}.
Data, in either of these two forms, may have a special property called~\emph{continuity}, i.e.,~their values change and can be observed through time.
Depending on the span of observation, we distinguish the following categories:
\begin{itemize}
\item \emph{Finite data}---data are observed during a predefined time interval.
\item \emph{Infinite data}---data are observed in an uninterrupted fashion.
\end{itemize}
\begin{example}
\label{ex:continuous}
Extending Example~\ref{ex:snapshot}, Table~\ref{tab:continuous} shows an example of continuous data observation, by introducing one data table for each consecutive timestamp.
The two data tables, over the time-span $[t_1, t_2]$ are an example of finite data.
Infinite data are the whole series of data obtained over the period~$[t_1, \infty)$ (infinity is denoted by `\dots').
\includetable{continuous}
\end{example}
We further define two sub-categories applicable to both finite and infinite data: \emph{sequential} and \emph{incremental} data; these two subcategories are not exhaustive, i.e.,~not all data sets belong to the one or the other category.
In sequential data, the value of the observed variable changes, depending on its previous value.
For example, trajectories are finite sequences of location stamps, as naturally the position at each timestamp is connected to the position at the previous timestamp.
In incremental data, an original data set is augmented in each subsequent timestamp with supplementary information.
For example, trajectories can be considered as incremental data, when at each timestamp we consider all the previously visited locations by an individual, incremented by his current position.
\subsection{Processing and publishing}
\label{subsec:data-publishing}
We categorize data processing and publishing based on the implemented scheme, as:
\begin{itemize}
\item \emph{Global}---data are collected, processed and privacy-protected, and then published by a central (trusted) entity, e.g.,~\cite{mcsherry2009privacy, blocki2013differentially, johnson2018towards}.
\item \emph{Local}---data are stored, processed and privacy-protected on the side of data generators before sending them to any intermediate or final entity, e.g.,~\cite{andres2013geo, erlingsson2014rappor, katsomallos2017open}.
\end{itemize}
\begin{figure}[htp]
\centering
\subcaptionbox{Global scheme\label{fig:scheme-global}}{%
\includegraphics[width=\linewidth]{scheme-global}%
} \\ \bigskip
\subcaptionbox{Local scheme\label{fig:scheme-local}}{%
\includegraphics[width=\linewidth]{scheme-local}%
}
\caption{The usual flow of user-generated data, optionally harvested by data publishers, privacy-protected, and released to data consumers, according to the (a)~global, and (b)~local privacy schemes.}
\label{fig:privacy-schemes}
\end{figure}
In the case of location data privacy, the existing literature is divided in
\emph{service-} and \emph{data-}centric methods~\cite{chow2011trajectory}.
The service-centric methods correspond to scenarios where individuals share their privacy-protected location with a service to get some relevant information (local publishing scheme).
The data-centric methods relate to the publishing of user-generated data to data consumers (global publishing scheme).
There is a long-standing debate whether the local or the global architectural scheme is more efficient with respect to not only privacy, but also organizational, economic, and security factors~\cite{king1983centralized}.
On the one hand, in the global privacy scheme (Figure~\ref{fig:scheme-global}), the dependence on third-party entities poses the risk of arbitrary privacy leakage from a compromised data publisher.
Nonetheless, the expertise of these entities is usually superior to that of the majority of (non-technical) data generators' in terms of understanding privacy permissions/\allowbreak policies and setting-up relevant preferences.
Moreover, in the global architecture, less distortion is necessary before publicly releasing the aggregated data set, naturally because the data sets are larger and users can be `hidden' more easily.
On the other hand, the local privacy scheme (Figure~\ref{fig:scheme-local}) facilitates fine-grained data management, offering to every individual better control over their data~\cite{goldreich1998secure}.
Nonetheless, data distortion at an early stage might prove detrimental to the overall utility of the aggregated data set.
The so far consensus is that there is no overall optimal solution among the two designs.
Most service-providing companies prefer the global scheme, mainly for reasons of better management and control over the data, while several privacy advocates support the local privacy scheme that offers users full control over what and how data are published.
Although there have been attempts to bridge the gap between them, e.g.,~\cite{bittau2017prochlo}, the global scheme is considerably better explored and implemented~\cite{satyanarayanan2017emergence}.
For this reason, most of the works in our work span this context.
We distinguish between two publishing modes for private data: \emph{snapshot} and \emph{continuous}.
In snapshot publishing (also appearing as \emph{one-shot} or \emph{one-off} publishing), the system processes and releases a data set at a specific point in time and thereafter is not concerned anymore with the specific data set.
For example, in Figure~\ref{fig:mode-snapshot} (ignore the privacy-preserving step for the moment) individuals send their data to an LBS provider, considering a specific time point.
In continuous data publishing the system computes, and publishes augmented or updated versions of one data set in different time points, and without a predefined duration.
In the context of privacy-preserving data publishing, privacy preservation is tightly coupled with the data processing and publishing stages.
As already discussed in Section~\ref{ch:intro}, in this work we are studying the continuous data publishing mode, and thus we do not include works considering the snapshot paradigm.
We make this deliberate choice as privacy-preserving continuous data publishing is a more complex problem, receiving more and more attention from the scientific community in the recent years, as shown by the increasing number of publications in this area.
Moreover, the use cases of continuous data publishing abound, with the proliferation of the Internet, sensors, and connected devices, which produce and send to servers huge amounts of continuous personal data in astounding speed.
We identify two main data processing and publishing modes:
\begin{itemize}
\item \emph{Batch}---data are considered in groups in specific time intervals.
\item \emph{Streaming}---data are considered per timestamp, infinitely.
\end{itemize}
\begin{figure}[htp]
\centering
\subcaptionbox{Snapshot mode\label{fig:mode-snapshot}}{%
\includegraphics[width=.4\linewidth]{mode-snapshot}%
} \\ \bigskip\hspace{\fill}
\subcaptionbox{Batch mode\label{fig:mode-batch}}{%
\includegraphics[width=.4\linewidth]{mode-batch}%
}\hspace{\fill}
\subcaptionbox{Streaming mode\label{fig:mode-streaming}}{%
\includegraphics[width=.4\linewidth]{mode-streaming}%
}\hspace{\fill}
\caption{The different data processing and publishing modes of continuously generated data sets.
(a)~Snapshot publishing, (b)~continuous publishing--batch mode, and (c)~continuous publishing--streaming mode.
$\pmb{o}_x$ denotes the privacy-protected version of the data set $D_x$ or statistics thereof, while `\dots' denote the continuous data generation and/or publishing, where applicable.
Depending on the data observation span, $n$ can either be finite or tend to infinity.}
\label{fig:privacy-modes}
\end{figure}
Batch data processing and publishing (Figure~\ref{fig:mode-batch}) is performed (usually offline) over both finite and infinite data, while streaming processing and publishing (Figure~\ref{fig:mode-streaming}) is by definition connected to infinite data (usually in real-time).