diff --git a/text/conclusion.tex b/text/conclusion.tex deleted file mode 100644 index 9e04ecc..0000000 --- a/text/conclusion.tex +++ /dev/null @@ -1,10 +0,0 @@ -\chapter{Conclusion and future work} -\label{ch:con} - - -\section{Thesis summary} -\label{sec:sum-thesis} - - -\section{Perspectives} -\label{sec:persp} diff --git a/text/conclusion/main.tex b/text/conclusion/main.tex new file mode 100644 index 0000000..3ccfbf9 --- /dev/null +++ b/text/conclusion/main.tex @@ -0,0 +1,5 @@ +\chapter{Conclusion and future work} +\label{ch:con} + +\input{conclusion/summary} +\input{conclusion/perspectives} diff --git a/text/conclusion/perspectives.tex b/text/conclusion/perspectives.tex new file mode 100644 index 0000000..6d32264 --- /dev/null +++ b/text/conclusion/perspectives.tex @@ -0,0 +1,2 @@ +\section{Perspectives} +\label{sec:persp} diff --git a/text/conclusion/summary.tex b/text/conclusion/summary.tex new file mode 100644 index 0000000..adca7e4 --- /dev/null +++ b/text/conclusion/summary.tex @@ -0,0 +1,2 @@ +\section{Thesis summary} +\label{sec:sum-thesis} diff --git a/text/introduction/contribution.tex b/text/introduction/contribution.tex new file mode 100644 index 0000000..507ac8f --- /dev/null +++ b/text/introduction/contribution.tex @@ -0,0 +1,2 @@ +\section{Contribution} +\label{sec:contr} diff --git a/text/introduction.tex b/text/introduction/main.tex similarity index 99% rename from text/introduction.tex rename to text/introduction/main.tex index cc4679c..ec84715 100644 --- a/text/introduction.tex +++ b/text/introduction/main.tex @@ -68,10 +68,5 @@ Typically, in such cases, we have a collection of data referring to the same ind Additionally, in many cases, the privacy-preserving processes should take into account implicit correlations and restrictions that exist, e.g.,~space-imposed collocation or movement restrictions. Since these data are related to most of the important applications and services that enjoy high utilization rates, privacy-preserving continuous data publishing becomes one of the emblematic problems of our time. - -\section{Contributions} -\label{sec:contr} - - -\section{Structure} -\label{sec:struct} +\input{introduction/contribution} +\input{introduction/structure} diff --git a/text/introduction/structure.tex b/text/introduction/structure.tex new file mode 100644 index 0000000..8a54fac --- /dev/null +++ b/text/introduction/structure.tex @@ -0,0 +1,2 @@ +\section{Structure} +\label{sec:struct} diff --git a/text/main.tex b/text/main.tex index b0293a6..142ac56 100644 --- a/text/main.tex +++ b/text/main.tex @@ -79,7 +79,6 @@ \input{acknowledgements} \tableofcontents - \listofalgorithms \listoffigures \listoftables @@ -88,11 +87,11 @@ % \nocite{*} -\input{introduction} -\input{preliminaries} -\input{related} -\input{the-thing} -\input{conclusion} +\input{introduction/main} +\input{preliminaries/main} +\input{related/main} +\input{the-thing/main} +\input{conclusion/main} \backmatter diff --git a/text/preliminaries/data.tex b/text/preliminaries/data.tex new file mode 100644 index 0000000..45d4a05 --- /dev/null +++ b/text/preliminaries/data.tex @@ -0,0 +1,165 @@ +\section{Data} +\label{sec:data} + + +\subsection{Categories} +\label{subsec:data-categories} + +As this survey is about privacy, the data that we are interested in, contain information about individuals and their actions. +We firstly classify the data based on their content: + +\begin{itemize} + \item \emph{Microdata}---the data items in their raw, usually tabular, form pertaining to individuals or objects. + \item \emph{Statistical data}---the outcome of statistical processes on microdata. +\end{itemize} + +An example of microdata is displayed in Table~\ref{tab:snapshot-micro}, while an example of statistical data in Table~\ref{tab:snapshot-statistical}. +Data, in either of these two forms, may have a special property called~\emph{continuity}, i.e.,~their values change and can be observed through time. +Depending on the span of observation, we distinguish the following categories: + +\begin{itemize} + \item \emph{Finite data}---data are observed during a predefined time interval. + \item \emph{Infinite data}---data are observed in an uninterrupted fashion. +\end{itemize} + +\begin{example} + \label{ex:continuous} + Extending Example~\ref{ex:snapshot}, Table~\ref{tab:continuous} shows an example of continuous data observation, by introducing one data table for each consecutive timestamp. + The two data tables, over the time-span $[t_1, t_2]$ are an example of finite data. + Infinite data are the whole series of data obtained over the period~$[t_1, \infty)$ (infinity is denoted by `\dots'). + + \begin{table} + \centering + \subcaptionbox{Microdata\label{tab:continuous-micro}}{% + \adjustbox{max width=\linewidth}{% + \begin{tabular}{@{}ccc@{}} + \begin{tabular}{@{}lrll@{}} + \toprule + \textit{Name} & \multicolumn{1}{c}{Age} & Location & Status \\ + \midrule + Donald & $27$ & Le Marais & at work \\ + Daisy & $25$ & Belleville & driving \\ + Huey & $12$ & Montmartre & running \\ + Dewey & $11$ & Montmartre & at home \\ + Louie & $10$ & Latin Quarter & walking \\ + Quackmore & $62$ & Opera & dining \\ + \bottomrule + \multicolumn{4}{c}{$t_1$} \\ + \end{tabular} & + \begin{tabular}{@{}lrll@{}} + \toprule + \textit{Name} & \multicolumn{1}{c}{Age} & Location & Status \\ + \midrule + Donald & $27$ & Montmartre & driving \\ + Daisy & $25$ & Montmartre & at the mall \\ + Huey & $12$ & Latin Quarter & sightseeing \\ + Dewey & $11$ & Opera & walking \\ + Louie & $10$ & Latin Quarter & at home \\ + Quackmore & $62$ & Montmartre & biking \\ + \bottomrule + \multicolumn{4}{c}{$t_2$} \\ + \end{tabular} & + \dots + \end{tabular}% + }% + } \\ \bigskip + \subcaptionbox{Statistical data\label{tab:continuous-statistical}}{% + \begin{tabular}{@{}lrrr@{}} + \toprule + \multirow{2}{*}{Location} & \multicolumn{3}{c@{}}{Count}\\ + & \multicolumn{1}{c}{$t_1$} & \multicolumn{1}{c}{$t_2$} & \dots \\ + \midrule + Belleville & $1$ & $0$ & \dots \\ + Latin Quarter & $1$ & $2$ & \dots \\ + Le Marais & $1$ & $0$ & \dots \\ + Montmartre & $2$ & $3$ & \dots \\ + Opera & $1$ & $1$ & \dots \\ + \bottomrule + \end{tabular}% + }% + \caption{Continuous data observation of (a)~microdata, and corresponding (b)~statistics at multiple timestamps.} + \label{tab:continuous} + \end{table} +\end{example} + +We further define two sub-categories applicable to both finite and infinite data: \emph{sequential} and \emph{incremental} data; these two subcategories are not exhaustive, i.e.,~not all data sets belong to the one or the other category. +In sequential data, the value of the observed variable changes, depending on its previous value. +For example, trajectories are finite sequences of location stamps, as naturally the position at each timestamp is connected to the position at the previous timestamp. +In incremental data, an original data set is augmented in each subsequent timestamp with supplementary information. +For example, trajectories can be considered as incremental data, when at each timestamp we consider all the previously visited locations by an individual, incremented by his current position. + + +\subsection{Processing and publishing} +\label{subsec:data-publishing} + +We categorize data processing and publishing based on the implemented scheme, as: + +\begin{itemize} + \item \emph{Global}---data are collected, processed and privacy-protected, and then published by a central (trusted) entity, e.g.,~\cite{mcsherry2009privacy, blocki2013differentially, johnson2018towards}. + \item \emph{Local}---data are stored, processed and privacy-protected on the side of data generators before sending them to any intermediate or final entity, e.g.,~\cite{andres2013geo, erlingsson2014rappor, katsomallos2017open}. +\end{itemize} + +\begin{figure}[htp] + \centering + \subcaptionbox{Global scheme\label{fig:scheme-global}}{% + \includegraphics[width=\linewidth]{scheme-global}% + } \\ \bigskip + \subcaptionbox{Local scheme\label{fig:scheme-local}}{% + \includegraphics[width=\linewidth]{scheme-local}% + } + \caption{The usual flow of user-generated data, optionally harvested by data publishers, privacy-protected, and released to data consumers, according to the (a)~global, and (b)~local privacy schemes.} + \label{fig:privacy-schemes} +\end{figure} + +In the case of location data privacy, the existing literature is divided in +\emph{service-} and \emph{data-}centric methods~\cite{chow2011trajectory}. +The service-centric methods correspond to scenarios where individuals share their privacy-protected location with a service to get some relevant information (local publishing scheme). +The data-centric methods relate to the publishing of user-generated data to data consumers (global publishing scheme). + +There is a long-standing debate whether the local or the global architectural scheme is more efficient with respect to not only privacy, but also organizational, economic, and security factors~\cite{king1983centralized}. +On the one hand, in the global privacy scheme (Figure~\ref{fig:scheme-global}), the dependence on third-party entities poses the risk of arbitrary privacy leakage from a compromised data publisher. +Nonetheless, the expertise of these entities is usually superior to that of the majority of (non-technical) data generators' in terms of understanding privacy permissions/\allowbreak policies and setting-up relevant preferences. +Moreover, in the global architecture, less distortion is necessary before publicly releasing the aggregated data set, naturally because the data sets are larger and users can be `hidden' more easily. +On the other hand, the local privacy scheme (Figure~\ref{fig:scheme-local}) facilitates fine-grained data management, offering to every individual better control over their data~\cite{goldreich1998secure}. +Nonetheless, data distortion at an early stage might prove detrimental to the overall utility of the aggregated data set. +The so far consensus is that there is no overall optimal solution among the two designs. +Most service-providing companies prefer the global scheme, mainly for reasons of better management and control over the data, while several privacy advocates support the local privacy scheme that offers users full control over what and how data are published. +Although there have been attempts to bridge the gap between them, e.g.,~\cite{bittau2017prochlo}, the global scheme is considerably better explored and implemented~\cite{satyanarayanan2017emergence}. +For this reason, most of the works in this survey span this context. + +We distinguish between two publishing modes for private data: \emph{snapshot} and \emph{continuous}. +In snapshot publishing (also appearing as \emph{one-shot} or \emph{one-off} publishing), the system processes and releases a data set at a specific point in time and thereafter is not concerned anymore with the specific data set. +For example, in Figure~\ref{fig:mode-snapshot} (ignore the privacy-preserving step for the moment) individuals send their data to an LBS provider, considering a specific time point. +In continuous data publishing the system computes, and publishes augmented or updated versions of one data set in different time points, and without a predefined duration. +In the context of privacy-preserving data publishing, privacy preservation is tightly coupled with the data processing and publishing stages. + +As already discussed in Section~\ref{ch:intro}, in this survey we are studying the continuous data publishing mode, and thus we do not include works considering the snapshot paradigm. +We make this deliberate choice as privacy-preserving continuous data publishing is a more complex problem, receiving more and more attention from the scientific community in the recent years, as shown by the increasing number of publications in this area. +Moreover, the use cases of continuous data publishing abound, with the proliferation of the Internet, sensors, and connected devices, which produce and send to servers huge amounts of continuous personal data in astounding speed. + +We identify two main data processing and publishing modes: + +\begin{itemize} + \item \emph{Batch}---data are considered in groups in specific time intervals. + \item \emph{Streaming}---data are considered per timestamp, infinitely. +\end{itemize} + +\begin{figure}[htp] + \centering + \subcaptionbox{Snapshot mode\label{fig:mode-snapshot}}{% + \includegraphics[width=.4\linewidth]{mode-snapshot}% + } \\ \bigskip\hspace{\fill} + \subcaptionbox{Batch mode\label{fig:mode-batch}}{% + \includegraphics[width=.4\linewidth]{mode-batch}% + }\hspace{\fill} + \subcaptionbox{Streaming mode\label{fig:mode-streaming}}{% + \includegraphics[width=.4\linewidth]{mode-streaming}% + }\hspace{\fill} + \caption{The different data processing and publishing modes of continuously generated data sets. + (a)~Snapshot publishing, (b)~continuous publishing--batch mode, and (c)~continuous publishing--streaming mode. + $\pmb{o}_x$ denotes the privacy-protected version of the data set $D_x$ or statistics thereof, while `\dots' denote the continuous data generation and/or publishing, where applicable. + Depending on the data observation span, $n$ can either be finite or tend to infinity.} + \label{fig:privacy-modes} +\end{figure} + +Batch data processing and publishing (Figure~\ref{fig:mode-batch}) is performed (usually offline) over both finite and infinite data, while streaming processing and publishing (Figure~\ref{fig:mode-streaming}) is by definition connected to infinite data (usually in real-time). diff --git a/text/preliminaries/main.tex b/text/preliminaries/main.tex new file mode 100644 index 0000000..a9c25c0 --- /dev/null +++ b/text/preliminaries/main.tex @@ -0,0 +1,55 @@ +\chapter{Preliminaries} +\label{ch:prel} + +In this chapter, we introduce some relevant terminology and background knowledge around the problem of continuous publishing of sensitive data sets. +First, we categorize data as we view them in the context of continuous data publishing. +Second, we define data privacy, we list the kinds of attacks that have been identified in the literature, as well as the desired privacy levels that can be achieved, and the basic privacy operations that are applied to achieve data privacy. +Third, we provide a brief overview of the seminal works on privacy-preserving data publishing, used also in continuous data publishing, fundamental in the domain and important for the understanding of the rest of the survey. + +To accompany and facilitate the descriptions in this chapter, we provide the following running example. + +\begin{example} + \label{ex:snapshot} + Users interact with an LBS by making queries in order to retrieve some useful location-based information or just reporting user-state at various locations. + This user--LBS interaction generates user-related data, organized in a schema with the following attributes: \emph{Name} (the unique identifier of the table), \emph{Age}, \emph{Location}, and \emph{Status} (Table~\ref{tab:snapshot-micro}). + The `Status' attribute includes information that characterizes the user's state or the query itself, and its value varies according to the service functionality. + Subsequently, the generated data are aggregated (by issuing count queries over them) in order to derive useful information about the popularity of the venues during the day (Table~\ref{tab:snapshot-statistical}). + + \begin{table} + \centering\hspace{\fill} + \subcaptionbox{Microdata\label{tab:snapshot-micro}}{% + \begin{tabular}{@{}lrll@{}} + \toprule + \textit{Name} & \multicolumn{1}{c}{Age} & Location & Status \\ + \midrule + Donald & $27$ & Le Marais & at work \\ + Daisy & $25$ & Belleville & driving \\ + Huey & $12$ & Montmartre & running \\ + Dewey & $11$ & Montmartre & at home \\ + Louie & $10$ & Latin Quarter & walking \\ + Quackmore & $62$ & Opera & dining \\ + \bottomrule + \end{tabular}% + }\hspace{\fill} + \subcaptionbox{Statistical data\label{tab:snapshot-statistical}}{% + \begin{tabular}{@{}lr@{}} + \toprule + Location & \multicolumn{1}{c@{}}{Count} \\ + \midrule + Belleville & $1$ \\ + Latin Quarter & $1$ \\ + Le Marais & $1$ \\ + Montmartre & $2$ \\ + Opera & $1$ \\ + \bottomrule + \\ + \end{tabular}% + }\hspace{\fill} + \caption{Example of raw user-generated (a)~microdata, and related (b)~statistical data for a specific timestamp.} + \label{tab:snapshot} + \end{table} +\end{example} + +\input{preliminaries/data} +\input{preliminaries/privacy} +\input{preliminaries/summary} diff --git a/text/preliminaries.tex b/text/preliminaries/privacy.tex similarity index 71% rename from text/preliminaries.tex rename to text/preliminaries/privacy.tex index 8e63759..0c74765 100644 --- a/text/preliminaries.tex +++ b/text/preliminaries/privacy.tex @@ -1,223 +1,3 @@ -\chapter{Preliminaries} -\label{ch:prel} - -In this chapter, we introduce some relevant terminology and background knowledge around the problem of continuous publishing of sensitive data sets. -First, we categorize data as we view them in the context of continuous data publishing. -Second, we define data privacy, we list the kinds of attacks that have been identified in the literature, as well as the desired privacy levels that can be achieved, and the basic privacy operations that are applied to achieve data privacy. -Third, we provide a brief overview of the seminal works on privacy-preserving data publishing, used also in continuous data publishing, fundamental in the domain and important for the understanding of the rest of the survey. - -To accompany and facilitate the descriptions in this chapter, we provide the following running example. - -\begin{example} - \label{ex:snapshot} - Users interact with an LBS by making queries in order to retrieve some useful location-based information or just reporting user-state at various locations. - This user--LBS interaction generates user-related data, organized in a schema with the following attributes: \emph{Name} (the unique identifier of the table), \emph{Age}, \emph{Location}, and \emph{Status} (Table~\ref{tab:snapshot-micro}). - The `Status' attribute includes information that characterizes the user's state or the query itself, and its value varies according to the service functionality. - Subsequently, the generated data are aggregated (by issuing count queries over them) in order to derive useful information about the popularity of the venues during the day (Table~\ref{tab:snapshot-statistical}). - - \begin{table} - \centering\hspace{\fill} - \subcaptionbox{Microdata\label{tab:snapshot-micro}}{% - \begin{tabular}{@{}lrll@{}} - \toprule - \textit{Name} & \multicolumn{1}{c}{Age} & Location & Status \\ - \midrule - Donald & $27$ & Le Marais & at work \\ - Daisy & $25$ & Belleville & driving \\ - Huey & $12$ & Montmartre & running \\ - Dewey & $11$ & Montmartre & at home \\ - Louie & $10$ & Latin Quarter & walking \\ - Quackmore & $62$ & Opera & dining \\ - \bottomrule - \end{tabular}% - }\hspace{\fill} - \subcaptionbox{Statistical data\label{tab:snapshot-statistical}}{% - \begin{tabular}{@{}lr@{}} - \toprule - Location & \multicolumn{1}{c@{}}{Count} \\ - \midrule - Belleville & $1$ \\ - Latin Quarter & $1$ \\ - Le Marais & $1$ \\ - Montmartre & $2$ \\ - Opera & $1$ \\ - \bottomrule - \\ - \end{tabular}% - }\hspace{\fill} - \caption{Example of raw user-generated (a)~microdata, and related (b)~statistical data for a specific timestamp.} - \label{tab:snapshot} - \end{table} -\end{example} - - -\section{Data} -\label{sec:data} - - -\subsection{Categories} -\label{subsec:data-categories} - -As this survey is about privacy, the data that we are interested in, contain information about individuals and their actions. -We firstly classify the data based on their content: - -\begin{itemize} - \item \emph{Microdata}---the data items in their raw, usually tabular, form pertaining to individuals or objects. - \item \emph{Statistical data}---the outcome of statistical processes on microdata. -\end{itemize} - -An example of microdata is displayed in Table~\ref{tab:snapshot-micro}, while an example of statistical data in Table~\ref{tab:snapshot-statistical}. -Data, in either of these two forms, may have a special property called~\emph{continuity}, i.e.,~their values change and can be observed through time. -Depending on the span of observation, we distinguish the following categories: - -\begin{itemize} - \item \emph{Finite data}---data are observed during a predefined time interval. - \item \emph{Infinite data}---data are observed in an uninterrupted fashion. -\end{itemize} - -\begin{example} - \label{ex:continuous} - Extending Example~\ref{ex:snapshot}, Table~\ref{tab:continuous} shows an example of continuous data observation, by introducing one data table for each consecutive timestamp. - The two data tables, over the time-span $[t_1, t_2]$ are an example of finite data. - Infinite data are the whole series of data obtained over the period~$[t_1, \infty)$ (infinity is denoted by `\dots'). - - \begin{table} - \centering - \subcaptionbox{Microdata\label{tab:continuous-micro}}{% - \adjustbox{max width=\linewidth}{% - \begin{tabular}{@{}ccc@{}} - \begin{tabular}{@{}lrll@{}} - \toprule - \textit{Name} & \multicolumn{1}{c}{Age} & Location & Status \\ - \midrule - Donald & $27$ & Le Marais & at work \\ - Daisy & $25$ & Belleville & driving \\ - Huey & $12$ & Montmartre & running \\ - Dewey & $11$ & Montmartre & at home \\ - Louie & $10$ & Latin Quarter & walking \\ - Quackmore & $62$ & Opera & dining \\ - \bottomrule - \multicolumn{4}{c}{$t_1$} \\ - \end{tabular} & - \begin{tabular}{@{}lrll@{}} - \toprule - \textit{Name} & \multicolumn{1}{c}{Age} & Location & Status \\ - \midrule - Donald & $27$ & Montmartre & driving \\ - Daisy & $25$ & Montmartre & at the mall \\ - Huey & $12$ & Latin Quarter & sightseeing \\ - Dewey & $11$ & Opera & walking \\ - Louie & $10$ & Latin Quarter & at home \\ - Quackmore & $62$ & Montmartre & biking \\ - \bottomrule - \multicolumn{4}{c}{$t_2$} \\ - \end{tabular} & - \dots - \end{tabular}% - }% - } \\ \bigskip - \subcaptionbox{Statistical data\label{tab:continuous-statistical}}{% - \begin{tabular}{@{}lrrr@{}} - \toprule - \multirow{2}{*}{Location} & \multicolumn{3}{c@{}}{Count}\\ - & \multicolumn{1}{c}{$t_1$} & \multicolumn{1}{c}{$t_2$} & \dots \\ - \midrule - Belleville & $1$ & $0$ & \dots \\ - Latin Quarter & $1$ & $2$ & \dots \\ - Le Marais & $1$ & $0$ & \dots \\ - Montmartre & $2$ & $3$ & \dots \\ - Opera & $1$ & $1$ & \dots \\ - \bottomrule - \end{tabular}% - }% - \caption{Continuous data observation of (a)~microdata, and corresponding (b)~statistics at multiple timestamps.} - \label{tab:continuous} - \end{table} -\end{example} - -We further define two sub-categories applicable to both finite and infinite data: \emph{sequential} and \emph{incremental} data; these two subcategories are not exhaustive, i.e.,~not all data sets belong to the one or the other category. -In sequential data, the value of the observed variable changes, depending on its previous value. -For example, trajectories are finite sequences of location stamps, as naturally the position at each timestamp is connected to the position at the previous timestamp. -In incremental data, an original data set is augmented in each subsequent timestamp with supplementary information. -For example, trajectories can be considered as incremental data, when at each timestamp we consider all the previously visited locations by an individual, incremented by his current position. - - -\subsection{Processing and publishing} -\label{subsec:data-publishing} - -We categorize data processing and publishing based on the implemented scheme, as: - -\begin{itemize} - \item \emph{Global}---data are collected, processed and privacy-protected, and then published by a central (trusted) entity, e.g.,~\cite{mcsherry2009privacy, blocki2013differentially, johnson2018towards}. - \item \emph{Local}---data are stored, processed and privacy-protected on the side of data generators before sending them to any intermediate or final entity, e.g.,~\cite{andres2013geo, erlingsson2014rappor, katsomallos2017open}. -\end{itemize} - -\begin{figure}[htp] - \centering - \subcaptionbox{Global scheme\label{fig:scheme-global}}{% - \includegraphics[width=\linewidth]{scheme-global}% - } \\ \bigskip - \subcaptionbox{Local scheme\label{fig:scheme-local}}{% - \includegraphics[width=\linewidth]{scheme-local}% - } - \caption{The usual flow of user-generated data, optionally harvested by data publishers, privacy-protected, and released to data consumers, according to the (a)~global, and (b)~local privacy schemes.} - \label{fig:privacy-schemes} -\end{figure} - -In the case of location data privacy, the existing literature is divided in -\emph{service-} and \emph{data-}centric methods~\cite{chow2011trajectory}. -The service-centric methods correspond to scenarios where individuals share their privacy-protected location with a service to get some relevant information (local publishing scheme). -The data-centric methods relate to the publishing of user-generated data to data consumers (global publishing scheme). - -There is a long-standing debate whether the local or the global architectural scheme is more efficient with respect to not only privacy, but also organizational, economic, and security factors~\cite{king1983centralized}. -On the one hand, in the global privacy scheme (Figure~\ref{fig:scheme-global}), the dependence on third-party entities poses the risk of arbitrary privacy leakage from a compromised data publisher. -Nonetheless, the expertise of these entities is usually superior to that of the majority of (non-technical) data generators' in terms of understanding privacy permissions/\allowbreak policies and setting-up relevant preferences. -Moreover, in the global architecture, less distortion is necessary before publicly releasing the aggregated data set, naturally because the data sets are larger and users can be `hidden' more easily. -On the other hand, the local privacy scheme (Figure~\ref{fig:scheme-local}) facilitates fine-grained data management, offering to every individual better control over their data~\cite{goldreich1998secure}. -Nonetheless, data distortion at an early stage might prove detrimental to the overall utility of the aggregated data set. -The so far consensus is that there is no overall optimal solution among the two designs. -Most service-providing companies prefer the global scheme, mainly for reasons of better management and control over the data, while several privacy advocates support the local privacy scheme that offers users full control over what and how data are published. -Although there have been attempts to bridge the gap between them, e.g.,~\cite{bittau2017prochlo}, the global scheme is considerably better explored and implemented~\cite{satyanarayanan2017emergence}. -For this reason, most of the works in this survey span this context. - -We distinguish between two publishing modes for private data: \emph{snapshot} and \emph{continuous}. -In snapshot publishing (also appearing as \emph{one-shot} or \emph{one-off} publishing), the system processes and releases a data set at a specific point in time and thereafter is not concerned anymore with the specific data set. -For example, in Figure~\ref{fig:mode-snapshot} (ignore the privacy-preserving step for the moment) individuals send their data to an LBS provider, considering a specific time point. -In continuous data publishing the system computes, and publishes augmented or updated versions of one data set in different time points, and without a predefined duration. -In the context of privacy-preserving data publishing, privacy preservation is tightly coupled with the data processing and publishing stages. - -As already discussed in Section~\ref{ch:intro}, in this survey we are studying the continuous data publishing mode, and thus we do not include works considering the snapshot paradigm. -We make this deliberate choice as privacy-preserving continuous data publishing is a more complex problem, receiving more and more attention from the scientific community in the recent years, as shown by the increasing number of publications in this area. -Moreover, the use cases of continuous data publishing abound, with the proliferation of the Internet, sensors, and connected devices, which produce and send to servers huge amounts of continuous personal data in astounding speed. - -We identify two main data processing and publishing modes: - -\begin{itemize} - \item \emph{Batch}---data are considered in groups in specific time intervals. - \item \emph{Streaming}---data are considered per timestamp, infinitely. -\end{itemize} - -\begin{figure}[htp] - \centering - \subcaptionbox{Snapshot mode\label{fig:mode-snapshot}}{% - \includegraphics[width=.4\linewidth]{mode-snapshot}% - } \\ \bigskip\hspace{\fill} - \subcaptionbox{Batch mode\label{fig:mode-batch}}{% - \includegraphics[width=.4\linewidth]{mode-batch}% - }\hspace{\fill} - \subcaptionbox{Streaming mode\label{fig:mode-streaming}}{% - \includegraphics[width=.4\linewidth]{mode-streaming}% - }\hspace{\fill} - \caption{The different data processing and publishing modes of continuously generated data sets. - (a)~Snapshot publishing, (b)~continuous publishing--batch mode, and (c)~continuous publishing--streaming mode. - $\pmb{o}_x$ denotes the privacy-protected version of the data set $D_x$ or statistics thereof, while `\dots' denote the continuous data generation and/or publishing, where applicable. - Depending on the data observation span, $n$ can either be finite or tend to infinity.} - \label{fig:privacy-modes} -\end{figure} - -Batch data processing and publishing (Figure~\ref{fig:mode-batch}) is performed (usually offline) over both finite and infinite data, while streaming processing and publishing (Figure~\ref{fig:mode-streaming}) is by definition connected to infinite data (usually in real-time). - - \section{Privacy} \label{sec:privacy} @@ -575,10 +355,3 @@ The technique adds random noise drawn from a multivariate Laplace distribution t \end{figure} \end{example} - - - -\section{Summary} -\label{sec:sum-bg} - -This is the summary of this chapter. diff --git a/text/preliminaries/summary.tex b/text/preliminaries/summary.tex new file mode 100644 index 0000000..53c50e6 --- /dev/null +++ b/text/preliminaries/summary.tex @@ -0,0 +1,4 @@ +\section{Summary} +\label{sec:sum-bg} + +This is the summary of this chapter. diff --git a/text/related.tex b/text/related/main.tex similarity index 95% rename from text/related.tex rename to text/related/main.tex index 99a21a1..0b99503 100644 --- a/text/related.tex +++ b/text/related/main.tex @@ -15,11 +15,6 @@ For example, Zhou et al.~\cite{zhou2008brief} have a focus on social networks, a Nevertheless, to the best of our knowledge, there is no up-to-date survey that deals with privacy under continuous data publishing covering diverse use cases. Such a survey becomes very useful nowadays, due to the abundance of continuously user-generated data sets that could be analyzed and/or published in a privacy-preserving way, and the quick progress made in this research field. -\input{micro} -\input{statistical} - - -\section{Summary} -\label{sec:sum-rel} - -This is the summary of this chapter. +\input{related/micro} +\input{related/statistical} +\input{related/summary} diff --git a/text/micro.tex b/text/related/micro.tex similarity index 100% rename from text/micro.tex rename to text/related/micro.tex diff --git a/text/statistical.tex b/text/related/statistical.tex similarity index 100% rename from text/statistical.tex rename to text/related/statistical.tex diff --git a/text/related/summary.tex b/text/related/summary.tex new file mode 100644 index 0000000..574bdf3 --- /dev/null +++ b/text/related/summary.tex @@ -0,0 +1,4 @@ +\section{Summary} +\label{sec:sum-rel} + +This is the summary of this chapter. diff --git a/text/the-thing/contribution.tex b/text/the-thing/contribution.tex new file mode 100644 index 0000000..266ff1c --- /dev/null +++ b/text/the-thing/contribution.tex @@ -0,0 +1,7 @@ +\section{Contribution} +\label{sec:lmdk-contrib} + +In this chapter, we formally define a novel privacy notion that we call \emph{{\thething} privacy}. +We apply this privacy notion to time series consisting of \emph{{\thethings}} and regular events, and we design and implement three {\thething} privacy mechanisms. +We further study {\thething} privacy under temporal correlation that is inherent in time series publishing. +Finally, we evaluate {\thething} privacy with real and synthetic data sets, in settings with or without temporal correlation, showcasing the validity of our model. diff --git a/text/the-thing/evaluation.tex b/text/the-thing/evaluation.tex new file mode 100644 index 0000000..7b82d91 --- /dev/null +++ b/text/the-thing/evaluation.tex @@ -0,0 +1,128 @@ +\section{Evaluation} +\label{sec:the-thing-eval} + +In this section we present the experiments that we performed on real and synthetic data sets. +With the experiments on the synthetic data sets we show the privacy loss by our framework when tuning the size and statistical characteristics of the input {\thething} set $L$. +We also show how the privacy loss under temporal correlation is affected by the number and distribution of the {\thethings}. +With the experiments on the real data sets, we show the performance in terms of utility of our three {\thething} mechanisms. + +Notice that in our experiments, in the cases when we have $0\%$ and $100\%$ of the events being {\thethings}, we get the same behavior as in event- and user-level privacy respectively. +This happens due the fact that at each timestamp we take into account only the data items at the current timestamp and ignore the rest of the time series (event-level) when there are no {\thethings}. +Whereas, when each timestamp corresponds to a {\thething} we consider and protect all the events throughout the entire series (user-level). + + +\subsection{Setting, configurations, and data sets} +\paragraph{Setting} +We implemented our experiments\footnote{Code available at \url{https://gitlab.com/adhesivegoldfinch/cikm}} in Python $3$.$9$.$5$ and executed them on a machine with Intel i$7$-$6700$HQ $3$.$5$GHz CPU and $16$GB RAM, running Manjaro $21$.$0$.$5$. +We repeated each experiment $100$ times and we report the mean over these iterations. + + +\paragraph{Data sets} +For the \emph{real} data sets, we used the Geolife~\cite{zheng2010geolife} and T-drive~\cite{yuan2010t} from which we sampled the first $1000$ data items. +We achieved the desired {\thethings} percentages by utilizing the method of Li et al.~\cite{li2008mining} for detecting stay points in trajectory data. +In more detail, the algorithm checks for each data item if each subsequent item is within a given distance threshold $\Delta l$ and measures the time period $\Delta t$ between the present point and the last subsequent point. +We achieve $0$, $20$ $40$, $60$, $80$, and $100$ {\thethings} percentages by setting the ($\Delta l$ in meters, $\Delta t$ in minutes) pairs input to the stay point discovery method for T-drive as [($0$, $1000$), ($2095$, $30$), ($2790$, $30$), ($3590$, $30$), ($4825$, $30$), ($10350$, $30$)] and for Geolife as [($0$, $100000$), ($205$, $30$), ($450$, $30$), ($725$, $30$), ($855$, $30$), ($50000$, $30$)]. + + +Next, we generated synthetic time series of length equal to $100$ timestamps, for which we varied the number and distribution of {\thethings}. +% to achieve the necessary {\thethings} distribution and percentage for where applicable. +% \paragraph{{\Thethings} distribution} +We created \emph{left-skewed} (the {\thethings} are distributed towards the end), \emph{symmetric} (in the middle), \emph{right-skewed} (in the beginning), \emph{bimodal} (both end and beginning), and \emph{uniform} (all over the time series) {\thething} distributions. +%, in the beginning and in the end (\emph{bimodal}), and all over the extend (\emph{uniform}) of a time series. +When pertinent, we group the left- and right-skewed cases as simply `skewed', since they share several features due to symmetry. +In order to get {\thethings} with the above distribution features, we generate probability distributions with appropriate characteristics and sample from them, without replacement, the desired number of points. +%The generated distributions are representative of the cases that we wish to examine during the experiments. +% For example, for a left-skewed {\thethings} distribution we would utilize a truncated distribution resulting from the restriction of the domain of a normal distribution to the beginning and end of the time series with its location shifted to the center of the right half of the series. +For consistency, we calculate the scale parameter depending on the length of the series by setting it equal to the series' length over a constant. +%We take into account only the temporal order of the points and the position of regular and {\thething} events within the series. +Note, that for the experiments performed on the synthetic data sets, the original values to be released do not influence the outcome of our conclusions, thus we ignore them. + + +\paragraph{Configurations} +We model the temporal correlation in the synthetic data as a \emph{stochastic matrix} $P$, using a \emph{Markov Chain}~\cite{gagniuc2017markov}. +$P$ is a $n \times n$ matrix, where the element $p_{ij}$ +%at the $i$th row of the $j$th column that +represents the transition probability from a state $i$ to another state $j$. +%, $\forall i, j \leq n$. +It holds that the elements of every row $j$ of $P$ sum up to $1$. +We follow the \emph{Laplacian smoothing} technique~\cite{sorkine2004laplacian} as utilized in~\cite{cao2018quantifying} to generate the matrix $P$ with a degree of temporal correlation $s>0$. +% and generate a stochastic matrix $P$ with a degree of temporal correlation $s$ by calculating each element $P_{ij}$ as follows +%$$\frac{(I_{n})_{ij} + s}{\sum_{k = 1}^{n}((I_{n})_{jk} + s)}$$ +%where $I_{n}$ is an \emph{identity matrix} of size $n$. +%, i.e.,~an $n \times n$ matrix with $1$s on its main diagonal and $0$s elsewhere. +% $s$ takes only positive values which are comparable only for stochastic matrices of the same size. +$s$ dictates the strength of the correlation; the lower its value, +%the lower the degree of uniformity of each row, and therefore +the stronger the correlation degree. +%In general, larger transition matrices tend to be uniform, resulting in weaker correlation. +In our experiments, for simplicity, we set $n = 2$ and we investigate the effect of \emph{weak} ($s = 1$), \emph{moderate} ($s = 0.1$), and \emph{strong} ($s = 0.01$) temporal correlation degree on the overall privacy loss. + +We set $\varepsilon = 1$. +To perturb the spatial values of the real data sets, we inject noise that we sample from the Planar Laplace mechanism~\cite{andres2013geo}. +Finally, notice that all diagrams are in logarithmic scale. + +\subsection{Experiments} + +\paragraph{Budget allocation schemes} + +Figure~\ref{fig:real} exhibits the performance of the three mechanisms: Skip, Uniform, and Adaptive. + +\begin{figure}[htp] + \centering + \subcaptionbox{Geolife\label{fig:geolife}}{% + \includegraphics[width=.5\linewidth]{geolife}% + }% + \subcaptionbox{T-drive\label{fig:t-drive}}{% + \includegraphics[width=.5\linewidth]{t-drive}% + }% + \caption{The mean absolute error (in meters) of the released data for different {\thethings} percentages.} + \label{fig:real} +\end{figure} + +For the Geolife data set (Figure~\ref{fig:geolife}), Skip has the best performance (measured in Mean Absolute Error, in meters) because it invests the most budget overall at every regular event, by approximating the {\thething} data based on previous releases. +Due to the data set's high density (every $1$--$5$ seconds or every $5$--$10$ meters per point) approximating constantly has a low impact on the data utility. +On the contrary, the lower density of the T-drive data set (Figure~\ref{fig:t-drive}) has a negative impact on the performance of Skip. +In the T-drive data set, the Adaptive mechanism outperforms the Uniform one by $10$\%--$20$\% for all {\thethings} percentages greater than $0$ and by more than $20$\% the Skip one. +In general, we can claim that the Adaptive is the best performing mechanism, if we take into consideration the drawbacks of the Skip mechanism mentioned in Section~\ref{subsec:lmdk-mechs}. Moreover, designing a data-dependent sampling scheme would possibly result in better results for Adaptive. + + +\paragraph{Temporal distance and correlation} +Figure~\ref{fig:avg-dist} shows a comparison of the average temporal distance of the events from the previous/next {\thething} or the start/end of the time series for various distributions in synthetic data. +More particularly, we count for every event the total number of events between itself and the nearest {\thething} or the series edge. +We observe that the uniform and bimodal distributions tend to limit the regular event--{\thething} distance. +This is due to the fact that the former scatters the {\thethings}, while the latter distributes them on both edges, leaving a shorter space uninterrupted by {\thethings}. +% and as a result they reduce the uninterrupted space by landmarks in the sequence. +On the contrary, distributing the {\thethings} at one part of the sequence, as in skewed or symmetric, creates a wider space without {\thethings}. + +\begin{figure}[htp] + \centering + \includegraphics[width=.5\linewidth]{avg-dist}% + \caption{Average temporal distance of the events from the {\thethings} for different {\thethings} percentages within a time series in various {\thethings} distributions.} + \label{fig:avg-dist} +\end{figure} + +Figure~\ref{fig:dist-cor} illustrates a comparison among the aforementioned distributions regarding the overall privacy loss under moderate (Figure~\ref{fig:dist-cor-mod}), and strong (Figure~\ref{fig:dist-cor-stg}) correlation degrees. +The line shows the overall privacy loss---for all cases of {\thethings} distribution---without temporal correlation. +We skip the presentation of the results under a weak correlation degree, since they converge in this case. +In combination with Figure~\ref{fig:avg-dist}, we conclude that a greater average event-{\thething} distance in a distribution can result into greater overall privacy loss under moderate and strong temporal correlation. +This is due to the fact that the backward/forward privacy loss accumulates more over time in wider spaces without {\thethings} (see Section~\ref{subsec:correlations}). +Furthermore, the behavior of the privacy loss is as expected regarding the temporal correlation degree. +Predictably, a stronger correlation degree generates higher privacy loss while widening the gap between the different distribution cases. +On the contrary, a weaker correlation degree makes it harder to differentiate among the {\thethings} distributions. + +\begin{figure}[htp] + \centering + \subcaptionbox{Weak correlation\label{fig:dist-cor-wk}}{% + \includegraphics[width=.5\linewidth]{dist-cor-wk}% + }% + \hspace{\fill} + \subcaptionbox{Moderate correlation\label{fig:dist-cor-mod}}{% + \includegraphics[width=.5\linewidth]{dist-cor-mod}% + }% + \subcaptionbox{Strong correlation\label{fig:dist-cor-stg}}{% + \includegraphics[width=.5\linewidth]{dist-cor-stg}% + }% + \caption{Privacy loss for different {\thethings} percentages and distributions, under weak, moderate, and strong degrees of temporal correlation. + The line shows the overall privacy loss without temporal correlation.} + \label{fig:dist-cor} +\end{figure} diff --git a/text/the-thing/main.tex b/text/the-thing/main.tex new file mode 100644 index 0000000..6e067d3 --- /dev/null +++ b/text/the-thing/main.tex @@ -0,0 +1,11 @@ +\chapter{Significant events} +\label{ch:the-thing} + +In this chapter, we propose a novel configurable privacy scheme, \emph{\thething} privacy, which takes into account significant events (\emph{\thethings}) in the time series and allocates the available privacy budget accordingly. +We propose two privacy models that guarantee {\thething} privacy and validate our proposal on real and synthetic data sets. + +\input{the-thing/motivation} +\input{the-thing/contribution} +\input{the-thing/problem} +\input{the-thing/evaluation} +\input{the-thing/summary} diff --git a/text/the-thing/motivation.tex b/text/the-thing/motivation.tex new file mode 100644 index 0000000..7a0d588 --- /dev/null +++ b/text/the-thing/motivation.tex @@ -0,0 +1,63 @@ +\section{Motivation} +\label{sec:lmdk-motiv} + +The plethora of sensors currently embedded in +or paired with personal devices and other infrastructures have paved the way for the development of numerous \emph{crowdsensing services} (e.g.,~Google Maps~\cite{gmaps}, Waze~\cite{waze}, etc.) based on the collected personal, and usually geotagged and timestamped data. + +User--service interactions gather personal event-like data, e.g.,~(\emph{`Bob', `dining', `Canal Saint-Martin', $17{:}00$}). +When the interactions are performed in a continuous manner, we obtain ~\emph{time series} of events. +An \emph{event} represents a user--service interaction, registering the information of the individual at a specific time point, i.e.,~a data item that is a pair of an identifying attribute of an individual and the---possibly sensitive---information at a timestamp (including contextual information). +It can be seen as a correspondence to a record in a database, where each individual may participate once, e.g.,~(`Bob', `dining', `Canal Saint-Martin', $5$). +Typically, users interact with the services more than once, generating data in a continuous manner (\emph{time series}). +The services collect and further process the time series in order to give useful feedback to the involved users or to provide valuable insight to various internal/external analytical services. +Depending on its span, we distinguish the processing into \emph{finite}, when taking place during a predefined time interval, and \emph{infinite}, when taking place in an uninterrupted fashion. +% Figure~\ref{fig:scenario} shows an example of a finite time series produced by a user (Bob) and composed by $8$ timestamps during his trajectory from his home (\'Elys\'ee) to his work (Louvre) to his hangout (Saint-Martin) and back to his home. + +\begin{example} + \label{ex:lmdk-scenario} + + Consider a finite sequence of spatiotemporal data generated by Bob during an interval of $\ 8$ timestamps, as shown in Figure~\ref{fig:lmdk-scenario}. + These data are the result of user--LBS interaction while retrieving location-based information or reporting user-state at various locations. + Events in a shade correspond to privacy-sensitive events that Bob has defined beforehand. For instance his home is around {\'E}lys{\'e}e, his workplace is around the Louvre, and his hangout is around Canal Saint-Martin. + + \begin{figure}[htp] + \centering + \includegraphics[width=\linewidth]{lmdk-scenario} + \caption{A time series with {\thethings} (highlighted in gray).} + \label{fig:lmdk-scenario} + \end{figure} + +\end{example} + +The regulation regarding the processing of user-generated data sets~\cite{tankard2016gdpr} requires the provision of privacy guarantees to the users. +At the same time, it is essential to provide utility metrics to the final consumers of the privacy-preserving process output. +A widely recognized tool that introduces probabilistic randomness to the original data, while quantifying with a parameter $\varepsilon$ (`privacy budget'~\cite{mcsherry2009privacy}) the privacy/utility ratio is \emph{$\varepsilon$-differential privacy}~\cite{dwork2006calibrating}. +\emph{Event}, \emph{user}~\cite{dwork2010differential, dwork2010pan}, and \emph{$w$-event}~\cite{kellaris2014differentially} comprise the possible levels of privacy protection. +Event-level limits the privacy protection to \emph{any single event}, user-level protects \emph{all the events} of any user, and $w$-event provides privacy protection to \emph{any sequence of $w$ events}. + +The privacy mechanisms for the aforementioned levels assume that in a time series any single event, or any sequence of events, or the entire series of events is equally privacy-significant for the users. +In reality, this is an simplistic assumption. +The significance of an event is related to certain user-defined privacy criteria, or to its adjacent events, as well as to the entire time series. +Identifying \emph{\thething} (significant) events can be done in an automatic or manual way (but is out of scope for this work). +For example, in spatiotemporal data, \emph{places where an individual spent some time} denote \emph{points of interest} (POIs) (called also stay points)~\cite{zheng2015trajectory}. +Such data items, and more particularly their spatial attribute values, can be less privacy-sensitive~\cite{primault2018long}, e.g.,~parks, theaters, etc. or, if individuals frequent them, they can reveal supplementary information, e.g.,~residences (home addresses)~\cite{gambs2010show}, places of worship (religious beliefs)~\cite{franceschi-bicchierairussell2015redditor}, etc. +POIs can be an example of how we can choose {\thethings}, but the idea is not limited to these. + +\begin{figure}[htp] + \centering + \includegraphics[width=\linewidth]{st-cont} + \caption{User-level and {\thething} $\varepsilon$-differential privacy protection for the time series of Figure~\ref{fig:lmdk-scenario}.} + \label{fig:st-cont} +\end{figure} + +We argue that protecting only {\thething} events along with any regular event release is sufficient for the user's protection, while it improves data utility. +Take for example the scenario in Figure~\ref{fig:st-cont}, where {\thethings} are highlighted in gray. +If we want to protect the {\thething} points, we have to allocate at most a budget of $\varepsilon$ to the {\thethings}. +Notice that the more budget we allocate to an event the less we protect it, but at the same time we maintain its utility. +In this scenario, event-level protection is not suitable since it can only protect one event at a time. +Hence, we have to apply user-level privacy protection by distributing equal portions of $\varepsilon$ to all the events, i.e.,~$\frac{\varepsilon}{8}$ to each one (the equivalent of applying $8$-event privacy). +In this way, we have protected the {\thething} points; we have allocated a total of $\frac{\varepsilon}{2}<\varepsilon$ to the {\thethings}. +However, perturbing by $\frac{\varepsilon}{8}$ each regular point deteriorates the data utility unnecessarily. +With {\thething} privacy we propose to distribute the budget taking into account only the existence of the {\thethings} when we release an event of the time series, i.e.,~allocating $\frac{\varepsilon}{5}$ ($4\ \text{\thethings} + 1\ \text{regular point}$) to each event (see Figure~\ref{fig:st-cont}). +This way, we still guarantee that the {\thethings} are adequately protected, as they receive a total budget of $\frac{4\varepsilon}{5}<\varepsilon$. +At the same time, we avoid over-perturbing the regular events, as we allocate to them a higher total budget ($\frac{4\varepsilon}{5}$) than in user-level ($\frac{\varepsilon}{2}$), and thus less noise. diff --git a/text/the-thing.tex b/text/the-thing/problem.tex similarity index 55% rename from text/the-thing.tex rename to text/the-thing/problem.tex index 4d81823..43054c3 100644 --- a/text/the-thing.tex +++ b/text/the-thing/problem.tex @@ -1,84 +1,3 @@ -\chapter{Significant events} -\label{ch:the-thing} - -In this chapter, we propose a novel configurable privacy scheme, \emph{\thething} privacy, which takes into account significant events (\emph{\thethings}) in the time series and allocates the available privacy budget accordingly. -We propose two privacy models that guarantee {\thething} privacy and validate our proposal on real and synthetic data sets. - - -\section{Motivation} -\label{sec:lmdk-motiv} - -The plethora of sensors currently embedded in -or paired with personal devices and other infrastructures have paved the way for the development of numerous \emph{crowdsensing services} (e.g.,~Google Maps~\cite{gmaps}, Waze~\cite{waze}, etc.) based on the collected personal, and usually geotagged and timestamped data. - -User--service interactions gather personal event-like data, e.g.,~(\emph{`Bob', `dining', `Canal Saint-Martin', $17{:}00$}). -When the interactions are performed in a continuous manner, we obtain ~\emph{time series} of events. -An \emph{event} represents a user--service interaction, registering the information of the individual at a specific time point, i.e.,~a data item that is a pair of an identifying attribute of an individual and the---possibly sensitive---information at a timestamp (including contextual information). -It can be seen as a correspondence to a record in a database, where each individual may participate once, e.g.,~(`Bob', `dining', `Canal Saint-Martin', $5$). -Typically, users interact with the services more than once, generating data in a continuous manner (\emph{time series}). -The services collect and further process the time series in order to give useful feedback to the involved users or to provide valuable insight to various internal/external analytical services. -Depending on its span, we distinguish the processing into \emph{finite}, when taking place during a predefined time interval, and \emph{infinite}, when taking place in an uninterrupted fashion. -% Figure~\ref{fig:scenario} shows an example of a finite time series produced by a user (Bob) and composed by $8$ timestamps during his trajectory from his home (\'Elys\'ee) to his work (Louvre) to his hangout (Saint-Martin) and back to his home. - -\begin{example} - \label{ex:lmdk-scenario} - - Consider a finite sequence of spatiotemporal data generated by Bob during an interval of $\ 8$ timestamps, as shown in Figure~\ref{fig:lmdk-scenario}. - These data are the result of user--LBS interaction while retrieving location-based information or reporting user-state at various locations. - Events in a shade correspond to privacy-sensitive events that Bob has defined beforehand. For instance his home is around {\'E}lys{\'e}e, his workplace is around the Louvre, and his hangout is around Canal Saint-Martin. - - \begin{figure}[htp] - \centering - \includegraphics[width=\linewidth]{lmdk-scenario} - \caption{A time series with {\thethings} (highlighted in gray).} - \label{fig:lmdk-scenario} - \end{figure} - -\end{example} - -The regulation regarding the processing of user-generated data sets~\cite{tankard2016gdpr} requires the provision of privacy guarantees to the users. -At the same time, it is essential to provide utility metrics to the final consumers of the privacy-preserving process output. -A widely recognized tool that introduces probabilistic randomness to the original data, while quantifying with a parameter $\varepsilon$ (`privacy budget'~\cite{mcsherry2009privacy}) the privacy/utility ratio is \emph{$\varepsilon$-differential privacy}~\cite{dwork2006calibrating}. -\emph{Event}, \emph{user}~\cite{dwork2010differential, dwork2010pan}, and \emph{$w$-event}~\cite{kellaris2014differentially} comprise the possible levels of privacy protection. -Event-level limits the privacy protection to \emph{any single event}, user-level protects \emph{all the events} of any user, and $w$-event provides privacy protection to \emph{any sequence of $w$ events}. - -The privacy mechanisms for the aforementioned levels assume that in a time series any single event, or any sequence of events, or the entire series of events is equally privacy-significant for the users. -In reality, this is an simplistic assumption. -The significance of an event is related to certain user-defined privacy criteria, or to its adjacent events, as well as to the entire time series. -Identifying \emph{\thething} (significant) events can be done in an automatic or manual way (but is out of scope for this work). -For example, in spatiotemporal data, \emph{places where an individual spent some time} denote \emph{points of interest} (POIs) (called also stay points)~\cite{zheng2015trajectory}. -Such data items, and more particularly their spatial attribute values, can be less privacy-sensitive~\cite{primault2018long}, e.g.,~parks, theaters, etc. or, if individuals frequent them, they can reveal supplementary information, e.g.,~residences (home addresses)~\cite{gambs2010show}, places of worship (religious beliefs)~\cite{franceschi-bicchierairussell2015redditor}, etc. -POIs can be an example of how we can choose {\thethings}, but the idea is not limited to these. - -\begin{figure}[htp] - \centering - \includegraphics[width=\linewidth]{st-cont} - \caption{User-level and {\thething} $\varepsilon$-differential privacy protection for the time series of Figure~\ref{fig:lmdk-scenario}.} - \label{fig:st-cont} -\end{figure} - -We argue that protecting only {\thething} events along with any regular event release is sufficient for the user's protection, while it improves data utility. -Take for example the scenario in Figure~\ref{fig:st-cont}, where {\thethings} are highlighted in gray. -If we want to protect the {\thething} points, we have to allocate at most a budget of $\varepsilon$ to the {\thethings}. -Notice that the more budget we allocate to an event the less we protect it, but at the same time we maintain its utility. -In this scenario, event-level protection is not suitable since it can only protect one event at a time. -Hence, we have to apply user-level privacy protection by distributing equal portions of $\varepsilon$ to all the events, i.e.,~$\frac{\varepsilon}{8}$ to each one (the equivalent of applying $8$-event privacy). -In this way, we have protected the {\thething} points; we have allocated a total of $\frac{\varepsilon}{2}<\varepsilon$ to the {\thethings}. -However, perturbing by $\frac{\varepsilon}{8}$ each regular point deteriorates the data utility unnecessarily. -With {\thething} privacy we propose to distribute the budget taking into account only the existence of the {\thethings} when we release an event of the time series, i.e.,~allocating $\frac{\varepsilon}{5}$ ($4\ \text{\thethings} + 1\ \text{regular point}$) to each event (see Figure~\ref{fig:st-cont}). -This way, we still guarantee that the {\thethings} are adequately protected, as they receive a total budget of $\frac{4\varepsilon}{5}<\varepsilon$. -At the same time, we avoid over-perturbing the regular events, as we allocate to them a higher total budget ($\frac{4\varepsilon}{5}$) than in user-level ($\frac{\varepsilon}{2}$), and thus less noise. - - -\section{Contribution} -\label{sec:lmdk-contrib} - -In this chapter, we formally define a novel privacy notion that we call \emph{{\thething} privacy}. -We apply this privacy notion to time series consisting of \emph{{\thethings}} and regular events, and we design and implement three {\thething} privacy mechanisms. -We further study {\thething} privacy under temporal correlation that is inherent in time series publishing. -Finally, we evaluate {\thething} privacy with real and synthetic data sets, in settings with or without temporal correlation, showcasing the validity of our model. - - \section{{\Thething} privacy} \label{sec:prob} @@ -417,142 +336,3 @@ Finally, $\alpha_t$ is equal to the sum of all $\alpha_i , i\in L \cup\{t\}$. %Notice that if $t$ is the first or last item in $L \cup \{i\}$ then we need to set $t_{\text{prv}} = 0$ or $t_{\text{nxt}} = \max(T) + 1$. %In Section~\ref{sec:eval}, we experimentally show how the distribution of {\thethings} impacts the overall privacy loss of the user. - - - -\section{Evaluation} -\label{sec:the-thing-eval} - -In this section we present the experiments that we performed on real and synthetic data sets. -With the experiments on the synthetic data sets we show the privacy loss by our framework when tuning the size and statistical characteristics of the input {\thething} set $L$. -We also show how the privacy loss under temporal correlation is affected by the number and distribution of the {\thethings}. -With the experiments on the real data sets, we show the performance in terms of utility of our three {\thething} mechanisms. - -Notice that in our experiments, in the cases when we have $0\%$ and $100\%$ of the events being {\thethings}, we get the same behavior as in event- and user-level privacy respectively. -This happens due the fact that at each timestamp we take into account only the data items at the current timestamp and ignore the rest of the time series (event-level) when there are no {\thethings}. -Whereas, when each timestamp corresponds to a {\thething} we consider and protect all the events throughout the entire series (user-level). - - -\subsection{Setting, configurations, and data sets} -\paragraph{Setting} -We implemented our experiments\footnote{Code available at \url{https://gitlab.com/adhesivegoldfinch/cikm}} in Python $3$.$9$.$5$ and executed them on a machine with Intel i$7$-$6700$HQ $3$.$5$GHz CPU and $16$GB RAM, running Manjaro $21$.$0$.$5$. -We repeated each experiment $100$ times and we report the mean over these iterations. - - -\paragraph{Data sets} -For the \emph{real} data sets, we used the Geolife~\cite{zheng2010geolife} and T-drive~\cite{yuan2010t} from which we sampled the first $1000$ data items. -We achieved the desired {\thethings} percentages by utilizing the method of Li et al.~\cite{li2008mining} for detecting stay points in trajectory data. -In more detail, the algorithm checks for each data item if each subsequent item is within a given distance threshold $\Delta l$ and measures the time period $\Delta t$ between the present point and the last subsequent point. -We achieve $0$, $20$ $40$, $60$, $80$, and $100$ {\thethings} percentages by setting the ($\Delta l$ in meters, $\Delta t$ in minutes) pairs input to the stay point discovery method for T-drive as [($0$, $1000$), ($2095$, $30$), ($2790$, $30$), ($3590$, $30$), ($4825$, $30$), ($10350$, $30$)] and for Geolife as [($0$, $100000$), ($205$, $30$), ($450$, $30$), ($725$, $30$), ($855$, $30$), ($50000$, $30$)]. - - -Next, we generated synthetic time series of length equal to $100$ timestamps, for which we varied the number and distribution of {\thethings}. -% to achieve the necessary {\thethings} distribution and percentage for where applicable. -% \paragraph{{\Thethings} distribution} -We created \emph{left-skewed} (the {\thethings} are distributed towards the end), \emph{symmetric} (in the middle), \emph{right-skewed} (in the beginning), \emph{bimodal} (both end and beginning), and \emph{uniform} (all over the time series) {\thething} distributions. -%, in the beginning and in the end (\emph{bimodal}), and all over the extend (\emph{uniform}) of a time series. -When pertinent, we group the left- and right-skewed cases as simply `skewed', since they share several features due to symmetry. -In order to get {\thethings} with the above distribution features, we generate probability distributions with appropriate characteristics and sample from them, without replacement, the desired number of points. -%The generated distributions are representative of the cases that we wish to examine during the experiments. -% For example, for a left-skewed {\thethings} distribution we would utilize a truncated distribution resulting from the restriction of the domain of a normal distribution to the beginning and end of the time series with its location shifted to the center of the right half of the series. -For consistency, we calculate the scale parameter depending on the length of the series by setting it equal to the series' length over a constant. -%We take into account only the temporal order of the points and the position of regular and {\thething} events within the series. -Note, that for the experiments performed on the synthetic data sets, the original values to be released do not influence the outcome of our conclusions, thus we ignore them. - - -\paragraph{Configurations} -We model the temporal correlation in the synthetic data as a \emph{stochastic matrix} $P$, using a \emph{Markov Chain}~\cite{gagniuc2017markov}. -$P$ is a $n \times n$ matrix, where the element $p_{ij}$ -%at the $i$th row of the $j$th column that -represents the transition probability from a state $i$ to another state $j$. -%, $\forall i, j \leq n$. -It holds that the elements of every row $j$ of $P$ sum up to $1$. -We follow the \emph{Laplacian smoothing} technique~\cite{sorkine2004laplacian} as utilized in~\cite{cao2018quantifying} to generate the matrix $P$ with a degree of temporal correlation $s>0$. -% and generate a stochastic matrix $P$ with a degree of temporal correlation $s$ by calculating each element $P_{ij}$ as follows -%$$\frac{(I_{n})_{ij} + s}{\sum_{k = 1}^{n}((I_{n})_{jk} + s)}$$ -%where $I_{n}$ is an \emph{identity matrix} of size $n$. -%, i.e.,~an $n \times n$ matrix with $1$s on its main diagonal and $0$s elsewhere. -% $s$ takes only positive values which are comparable only for stochastic matrices of the same size. -$s$ dictates the strength of the correlation; the lower its value, -%the lower the degree of uniformity of each row, and therefore -the stronger the correlation degree. -%In general, larger transition matrices tend to be uniform, resulting in weaker correlation. -In our experiments, for simplicity, we set $n = 2$ and we investigate the effect of \emph{weak} ($s = 1$), \emph{moderate} ($s = 0.1$), and \emph{strong} ($s = 0.01$) temporal correlation degree on the overall privacy loss. - -We set $\varepsilon = 1$. -To perturb the spatial values of the real data sets, we inject noise that we sample from the Planar Laplace mechanism~\cite{andres2013geo}. -Finally, notice that all diagrams are in logarithmic scale. - -\subsection{Experiments} - -\paragraph{Budget allocation schemes} - -Figure~\ref{fig:real} exhibits the performance of the three mechanisms: Skip, Uniform, and Adaptive. - -\begin{figure}[htp] - \centering - \subcaptionbox{Geolife\label{fig:geolife}}{% - \includegraphics[width=.5\linewidth]{geolife}% - }% - \subcaptionbox{T-drive\label{fig:t-drive}}{% - \includegraphics[width=.5\linewidth]{t-drive}% - }% - \caption{The mean absolute error (in meters) of the released data for different {\thethings} percentages.} - \label{fig:real} -\end{figure} - -For the Geolife data set (Figure~\ref{fig:geolife}), Skip has the best performance (measured in Mean Absolute Error, in meters) because it invests the most budget overall at every regular event, by approximating the {\thething} data based on previous releases. -Due to the data set's high density (every $1$--$5$ seconds or every $5$--$10$ meters per point) approximating constantly has a low impact on the data utility. -On the contrary, the lower density of the T-drive data set (Figure~\ref{fig:t-drive}) has a negative impact on the performance of Skip. -In the T-drive data set, the Adaptive mechanism outperforms the Uniform one by $10$\%--$20$\% for all {\thethings} percentages greater than $0$ and by more than $20$\% the Skip one. -In general, we can claim that the Adaptive is the best performing mechanism, if we take into consideration the drawbacks of the Skip mechanism mentioned in Section~\ref{subsec:lmdk-mechs}. Moreover, designing a data-dependent sampling scheme would possibly result in better results for Adaptive. - - -\paragraph{Temporal distance and correlation} -Figure~\ref{fig:avg-dist} shows a comparison of the average temporal distance of the events from the previous/next {\thething} or the start/end of the time series for various distributions in synthetic data. -More particularly, we count for every event the total number of events between itself and the nearest {\thething} or the series edge. -We observe that the uniform and bimodal distributions tend to limit the regular event--{\thething} distance. -This is due to the fact that the former scatters the {\thethings}, while the latter distributes them on both edges, leaving a shorter space uninterrupted by {\thethings}. -% and as a result they reduce the uninterrupted space by landmarks in the sequence. -On the contrary, distributing the {\thethings} at one part of the sequence, as in skewed or symmetric, creates a wider space without {\thethings}. - -\begin{figure}[htp] - \centering - \includegraphics[width=.5\linewidth]{avg-dist}% - \caption{Average temporal distance of the events from the {\thethings} for different {\thethings} percentages within a time series in various {\thethings} distributions.} - \label{fig:avg-dist} -\end{figure} - -Figure~\ref{fig:dist-cor} illustrates a comparison among the aforementioned distributions regarding the overall privacy loss under moderate (Figure~\ref{fig:dist-cor-mod}), and strong (Figure~\ref{fig:dist-cor-stg}) correlation degrees. -The line shows the overall privacy loss---for all cases of {\thethings} distribution---without temporal correlation. -We skip the presentation of the results under a weak correlation degree, since they converge in this case. -In combination with Figure~\ref{fig:avg-dist}, we conclude that a greater average event-{\thething} distance in a distribution can result into greater overall privacy loss under moderate and strong temporal correlation. -This is due to the fact that the backward/forward privacy loss accumulates more over time in wider spaces without {\thethings} (see Section~\ref{subsec:correlations}). -Furthermore, the behavior of the privacy loss is as expected regarding the temporal correlation degree. -Predictably, a stronger correlation degree generates higher privacy loss while widening the gap between the different distribution cases. -On the contrary, a weaker correlation degree makes it harder to differentiate among the {\thethings} distributions. - -\begin{figure}[htp] - \centering - \subcaptionbox{Weak correlation\label{fig:dist-cor-wk}}{% - \includegraphics[width=.5\linewidth]{dist-cor-wk}% - }% - \hspace{\fill} - \subcaptionbox{Moderate correlation\label{fig:dist-cor-mod}}{% - \includegraphics[width=.5\linewidth]{dist-cor-mod}% - }% - \subcaptionbox{Strong correlation\label{fig:dist-cor-stg}}{% - \includegraphics[width=.5\linewidth]{dist-cor-stg}% - }% - \caption{Privacy loss for different {\thethings} percentages and distributions, under weak, moderate, and strong degrees of temporal correlation. - The line shows the overall privacy loss without temporal correlation.} - \label{fig:dist-cor} -\end{figure} - - -\section{Summary and future work} -\label{sec:lmdk-sum} -In this chapter, we presented \emph{{\thething} privacy} for privacy-preserving time series publishing, which allows for the protection of significant events, while improving the utility of the final result w.r.t. the traditional user-level differential privacy. -We also proposed three models for {\thething} privacy, and quantified the privacy loss under temporal correlation. -Our experiments on real and synthetic data sets validate our proposal. -In the future, we aim to investigate privacy-preserving {\thething} selection and propose a mechanism based on user-preferences and semantics. diff --git a/text/the-thing/summary.tex b/text/the-thing/summary.tex new file mode 100644 index 0000000..b604246 --- /dev/null +++ b/text/the-thing/summary.tex @@ -0,0 +1,6 @@ +\section{Summary and future work} +\label{sec:lmdk-sum} +In this chapter, we presented \emph{{\thething} privacy} for privacy-preserving time series publishing, which allows for the protection of significant events, while improving the utility of the final result w.r.t. the traditional user-level differential privacy. +We also proposed three models for {\thething} privacy, and quantified the privacy loss under temporal correlation. +Our experiments on real and synthetic data sets validate our proposal. +In the future, we aim to investigate privacy-preserving {\thething} selection and propose a mechanism based on user-preferences and semantics.