the-last-thing/text/preliminaries/data.tex

187 lines
13 KiB
TeX
Raw Permalink Normal View History

2021-08-02 22:16:57 +02:00
\section{Data sets and data publishing}
2021-07-18 17:31:05 +02:00
\label{sec:data}
2021-10-21 22:59:18 +02:00
In this section, we categorize user-generated data sets in terms of their form and their processing and publishing.
2021-07-18 17:31:05 +02:00
2021-07-27 07:43:15 +02:00
\subsection{Data categories}
2021-07-18 17:31:05 +02:00
\label{subsec:data-categories}
2021-10-21 22:59:18 +02:00
% \kat{Again, the title of the thesis is user-generated data, so there should exist also a distinction between user-generated and third party generated data. Hospital data for example, would fall in the third party generated data.}
2021-08-05 11:55:19 +02:00
In this thesis, we are interested in data that contain information about individuals and their actions, as these are highly privacy-sensitive.
2021-10-21 22:59:18 +02:00
A typical category of such data are \emph{user-generated data} which are the outcome of users--services interactions, e.g., social media, location-based services (LBS), etc.
These interactions result in the generation of \emph{data items} which are tuples that typically contain a user identifier, a timestamp, and context information (e.g.,~location, activity, etc.)
We firstly classify data based on their
% content \kat{'based on their content' reminds me of health data, trajectories, etc., not if they are aggregated or not. }:
2021-10-24 13:24:25 +02:00
form in:
% \emph{microdata} and \emph{statistical data}.
2021-10-21 22:59:18 +02:00
% \kat{Use full sentences, even in the bullets. }
% \mk{OK}
2021-07-18 17:31:05 +02:00
\begin{itemize}
2022-01-07 04:08:20 +01:00
\item \emph{Microdata} (Figure~\ref{tab:snapshot-micro}) are the data items
2021-10-22 02:42:20 +02:00
% \kat{define data item}
% \mk{OK}
in their raw, usually tabular, form pertaining to individuals.
2021-10-21 22:59:18 +02:00
% or objects \kat{objects?}.
2022-01-07 04:08:20 +01:00
\item \emph{Statistical data} (Figure~\ref{tab:snapshot-statistical}) are the outcome of statistical processes on microdata, e.g.,~average, count, sum, etc.
2021-07-18 17:31:05 +02:00
\end{itemize}
2021-10-21 22:59:18 +02:00
To accompany and facilitate the descriptions in this chapter, we provide Example~\ref{ex:snapshot} as a running example.
\begin{example}
\label{ex:snapshot}
Users interact with an LBS by making queries in order to retrieve some useful location-based information or just reporting user-state at various locations.
2022-01-07 04:08:20 +01:00
This user--LBS interaction generates user-related data, organized in a schema with the following attributes: \emph{Name} (the unique identifier of the table), \emph{Age}, \emph{Location}, and \emph{Status} (Figure~\ref{tab:snapshot-micro}).
2021-10-21 22:59:18 +02:00
The `Status' attribute includes information that characterizes the user state or the query itself, and its value varies according to the service functionality.
2022-01-07 04:08:20 +01:00
Subsequently, the generated data are aggregated (by issuing count queries over them) in order to derive useful information about the popularity of the venues during the day (Figure~\ref{tab:snapshot-statistical}).
2021-10-10 06:05:41 +02:00
\includetable{preliminaries/snapshot}
\end{example}
2021-10-21 22:59:18 +02:00
% \kat{I miss the definition of data. You speak of data items, data values, what is the difference to data?}
% \mk{Done above}
2022-01-07 04:08:20 +01:00
An example of microdata is displayed in Figure~\ref{tab:snapshot-micro}, while an example of statistical data in Figure~\ref{tab:snapshot-statistical}.
2021-10-21 22:59:18 +02:00
Data, in either of these two forms, may have a special property called~\emph{continuity}, i.e.,~their values change and can be observed through time.
% \kat{The way that you define it here reminds temporal data. What is the difference?}
% \mk{It's the same, we talk about time in data, i.e., temporal data. No?}
% \kat{If you say that data may have a special property called continuity, we wonder about the existence of other properties. Be more explicit on why you choose to mention only this property.}
% \mk{OK}
Observing the evolution of the data attribute values over time may offer valuable insight regarding the underlying population not only about the past but also both about the present and future.
2021-10-24 13:24:25 +02:00
Depending on the span of the observation, we categorize data in:
% \emph{finite} and \emph{infinite}.
2021-07-18 17:31:05 +02:00
\begin{itemize}
2021-10-21 22:59:18 +02:00
\item \emph{Finite data} are observed during a predefined time interval.
\item \emph{Infinite data} are observed in an uninterrupted fashion.
2021-07-18 17:31:05 +02:00
\end{itemize}
\begin{example}
\label{ex:continuous}
2021-10-21 22:59:18 +02:00
Extending Example~\ref{ex:snapshot},
% \kat{Maybe put these three tables in a Figure instead of a table?}
% \mk{OK}
2022-01-07 04:08:20 +01:00
Figure~\ref{fig:continuous} shows an example of continuous data,
2021-10-21 22:59:18 +02:00
% observation
% \kat{maybe mention explicitly before what is data observation and continuous data observation }
% \mk{Did it above}
by introducing one data table for each consecutive timestamp.
The two data tables over the time span $[t_1, t_2]$ are an example of finite data.
2021-07-18 17:31:05 +02:00
Infinite data are the whole series of data obtained over the period~$[t_1, \infty)$ (infinity is denoted by `\dots').
2021-10-10 06:05:41 +02:00
\includetable{preliminaries/continuous}
2021-07-26 23:23:48 +02:00
2021-07-18 17:31:05 +02:00
\end{example}
2021-10-21 22:59:18 +02:00
% \kat{Why isn't next the presentation of sequential and incremental in bullets, as for the categories before?}
% \mk{Fixed}
2021-08-05 11:55:19 +02:00
2021-10-24 13:24:25 +02:00
We further define two sub-categories, which are not exhaustive, i.e.,~not all data sets belong to the one or the other category, applicable to both finite and infinite data:
% \emph{sequential} and \emph{incremental} data; these two subcategories are not exhaustive, i.e.,~not all data sets belong to the one or the other category.
2021-10-21 22:59:18 +02:00
\begin{itemize}
\item \emph{Sequential data} have variable values that change depending on their previous values.
For example, trajectories are finite sequences of location stamps, as naturally the position at each timestamp is connected to the position at the previous timestamp.
\item \emph{Incremental data} are augmented at each subsequent timestamp with supplementary information.
For example, trajectories can be considered as incremental data when at each timestamp we consider all the previously visited locations by an individual incremented by the individual's current position.
\end{itemize}
2021-07-18 17:31:05 +02:00
2021-07-27 07:43:15 +02:00
\subsection{Data processing and publishing}
2021-07-18 17:31:05 +02:00
\label{subsec:data-publishing}
2021-10-24 13:24:25 +02:00
We categorize data processing and publishing based on what entity has access to the raw data in the following schemes:
% \emph{global} and \emph{local} schemes.
2021-10-22 02:42:20 +02:00
% \kat{what does the implemented scheme refer to?}
% \mk{These are the bullet points... I change it}
2021-07-18 17:31:05 +02:00
\begin{itemize}
2021-10-22 16:29:34 +02:00
\item \emph{Global scheme} (Figure~\ref{fig:scheme-global}) dictates the collection, processing and privacy-protection, and then publishing of the data by a central (trusted) entity, e.g.,~\cite{mcsherry2009privacy, blocki2013differentially, johnson2018towards}.
\item \emph{Local scheme} (Figure~\ref{fig:scheme-local}) requires the storage, processing and privacy-protection of data on the side of data generators before sending them to any intermediate or final entity, e.g.,~\cite{andres2013geo, erlingsson2014rappor, katsomallos2017open}.
2021-07-18 17:31:05 +02:00
\end{itemize}
\begin{figure}[htp]
\centering
2021-10-22 16:29:34 +02:00
\subcaptionbox{Global scheme\label{fig:scheme-global}}{%
\includegraphics[width=\linewidth]{preliminaries/scheme-global}%
2021-07-18 17:31:05 +02:00
} \\ \bigskip
2021-10-22 16:29:34 +02:00
\subcaptionbox{Local scheme\label{fig:scheme-local}}{%
\includegraphics[width=\linewidth]{preliminaries/scheme-local}%
2021-07-18 17:31:05 +02:00
}
2021-10-22 16:29:34 +02:00
\caption{The usual flow of user-generated data, optionally harvested by data publishers, privacy-protected, and released to data consumers, according to the (a)~global, and (b)~local privacy schemes.}
\label{fig:privacy-schemes}
2021-07-18 17:31:05 +02:00
\end{figure}
2021-10-22 02:42:20 +02:00
In the case of location data privacy,
% the existing literature\kat{do not say literature, but sth related to the data processing and publishing}
data processing and publishing methods are divided in \emph{service-} and \emph{data-}centric~\cite{chow2011trajectory}.
2021-10-22 16:29:34 +02:00
The service-centric methods correspond to scenarios where individuals share their privacy-protected location with a service to get some relevant information (local publishing scheme).
The data-centric methods relate to the publishing of user-generated data to data consumers (global publishing scheme).
2021-10-22 02:42:20 +02:00
% \kat{I do not get the data-centric methods.. Can't data-centric be also service centric ? E.g., we publish our data to get back some service? Moreover, what is exactly the link between local and global and service and data centric? One to one ?}
% \mk{You've just described service-centric :) }
2021-07-18 17:31:05 +02:00
2021-10-22 16:29:34 +02:00
There is a long-standing debate whether the local or the global architectural scheme is more efficient with respect to not only privacy, but also organizational, economic, and security factors~\cite{king1983centralized}.
On the one hand, in the global privacy scheme (Figure~\ref{fig:scheme-global}), the dependence on third-party entities poses the risk of arbitrary privacy leakage from a compromised data publisher.
2021-07-18 17:31:05 +02:00
Nonetheless, the expertise of these entities is usually superior to that of the majority of (non-technical) data generators' in terms of understanding privacy permissions/\allowbreak policies and setting-up relevant preferences.
Moreover, in the global architecture, less distortion is necessary before publicly releasing the aggregated data set, naturally because the data sets are larger and users can be `hidden' more easily.
2021-10-22 16:29:34 +02:00
On the other hand, the local privacy scheme (Figure~\ref{fig:scheme-local}) facilitates fine-grained data management, offering to every individual better control over their data~\cite{goldreich1998secure}.
2021-07-18 17:31:05 +02:00
Nonetheless, data distortion at an early stage might prove detrimental to the overall utility of the aggregated data set.
The so far consensus is that there is no overall optimal solution among the two designs.
2021-10-22 16:29:34 +02:00
Most service-providing companies prefer the global scheme, mainly for reasons of better management and control over the data, while several privacy advocates support the local privacy scheme that offers users full control over what and how data are published.
Although there have been attempts to bridge the gap between them, e.g.,~\cite{bittau2017prochlo}, the global scheme is considerably better explored and implemented~\cite{satyanarayanan2017emergence}.
2021-10-22 02:42:20 +02:00
% For this reason, most of the works in our work span this context.
% \kat{this last sentence is out of context for the thesis dissertation. Please, explain why you said all that, but w.r.t. the thesis.}
% \mk{Omitting it seems to resolve the issue}
2021-07-18 17:31:05 +02:00
2021-10-24 13:24:25 +02:00
We distinguish publishing modes for private data between:
% \emph{snapshot} and \emph{continuous}.
2021-10-22 02:42:20 +02:00
% \kat{I do not like that you present some of the categories as bullets and others as plain text. Be consistent in one format.}
% \mk{You're right}
2021-07-18 17:31:05 +02:00
\begin{itemize}
2021-10-22 02:42:20 +02:00
\item \emph{Snapshot mode} (also appearing as \emph{one-shot} or \emph{one-off} publishing) processes and releases a data set at a specific point in time and thereafter is not concerned anymore with the specific data set.
For example, in Figure~\ref{fig:mode-snapshot} (ignore the privacy-preserving step for the moment) individuals send their data to an LBS provider, considering a specific timestamp.
The use cases of continuous data publishing abound, with the proliferation of the Internet, sensors, and connected devices, which produce and send to servers huge amounts of continuous personal data in astounding speed.
\item \emph{Continuous mode} computes and publishes augmented or updated versions of one data set in different timestamps, and without a predefined duration.
In the context of privacy-preserving data publishing, privacy preservation is tightly coupled with the data processing and publishing stages.
% \kat{This can be the introductory sentence of the sub-section, but does not fit here.}
% \kat{but so far you have already presented other categories for processing and publishing; why do you say here two main modes?}
% \mk{Merged it with continuous}
2021-10-24 13:24:25 +02:00
We further categorize continuous publishing mode into:
% \emph{batch} and \emph{streaming}.
2021-10-22 02:42:20 +02:00
\begin{itemize}
\item \emph{Batch mode} (Figure~\ref{fig:mode-batch}) considers data in groups in specific time intervals.
It is performed (usually offline) over both finite and infinite data
\item \emph{Streaming mode} (Figure~\ref{fig:mode-streaming}) processes data per timestamp, infinitely.
It is by definition connected to infinite data (usually in real-time).
\end{itemize}
2021-07-18 17:31:05 +02:00
\end{itemize}
2021-10-22 02:42:20 +02:00
% As already discussed in Section~\ref{ch:intro}, in this thesis we are studying the continuous data publishing mode, and thus we do not include works considering the snapshot paradigm.
% We have made this choice because privacy-preserving continuous data publishing is a more complex problem, receiving more and more attention from the scientific community in the recent years, as shown by the increasing number of publications in this area.
% \kat{this was a good argumentation but for the survey, not for the thesis..}
% \mk{Removed}
2021-07-18 17:31:05 +02:00
\begin{figure}[htp]
\centering
\subcaptionbox{Snapshot mode\label{fig:mode-snapshot}}{%
2021-10-22 02:42:20 +02:00
\includegraphics[width=.49\linewidth]{preliminaries/mode-snapshot}%
} \\ \bigskip\hfill
2021-07-18 17:31:05 +02:00
\subcaptionbox{Batch mode\label{fig:mode-batch}}{%
2021-10-22 02:42:20 +02:00
\includegraphics[width=.49\linewidth]{preliminaries/mode-batch}%
}\hfill
2021-07-18 17:31:05 +02:00
\subcaptionbox{Streaming mode\label{fig:mode-streaming}}{%
2021-10-22 02:42:20 +02:00
\includegraphics[width=.49\linewidth]{preliminaries/mode-streaming}%
}\hfill
\caption{
The different data processing and publishing modes of continuously generated data sets.
(a)~Snapshot publishing, (b)~continuous publishing--batch mode, and (c)~continuous publishing--streaming mode.
$\pmb{o}_x$ denotes the privacy-protected version of the data set $D_x$ or statistics thereof, while `\dots' denote the continuous data generation and/or publishing, where applicable.
Depending on the data observation span, $n$ can either be finite or tend to infinity.
% \kat{We cannot see in these scenarios the continuous querying of the same snapshot.}
% \mk{Does this appear anywhere in the literature?}
}
2021-07-18 17:31:05 +02:00
\label{fig:privacy-modes}
\end{figure}