problem: Structure

2021-10-08 21:32:06 +02:00
parent 69218c312a
commit 386065c411
8 changed files with 100 additions and 120 deletions
--- a/text/problem/main.tex
+++ b/text/problem/main.tex
@ -1,9 +1,11 @@
 <<<<<<< HEAD
 \chapter{Landmark privacy}
 \label{ch:thething-prv}
 =======
 \chapter{Landmark Privacy}
->>>>>>> b334e056b320357ce4f4eaa89a1be7f3576350cf
+\label{ch:lmdk-prv}
 In this chapter, we propose a novel configurable privacy scheme, \emph{\thething} privacy, which takes into account significant events (\emph{\thethings}) in the time series and allocates the available privacy budget accordingly.
 We propose two privacy models that guarantee {\thething} privacy.
 To further enhance our privacy method, and protect the {\thethings} position in the time series, we propose techniques to perturb the initial {\thethings} set (Section~\ref{sec:theotherthing}).
 \input{problem/thething/main}
 \input{problem/theotherthing/main}
 \input{problem/summary}
--- a/text/problem/thething/summary.tex
+++ b/text/problem/thething/summary.tex
@ -1,5 +1,6 @@
 \section{Summary}
 \label{sec:lmdk-sum}
 In this chapter, we presented \emph{{\thething} privacy} for privacy-preserving time series publishing, which allows for the protection of significant events, while improving the utility of the final result w.r.t. the traditional user-level differential privacy.
 We also proposed three models for  {\thething} privacy, and quantified the privacy loss under temporal correlation.
 %Our experiments on real and synthetic data sets validate our proposal. 
--- a/text/problem/theotherthing/main.tex
+++ b/text/problem/theotherthing/main.tex
@ -1,2 +1,2 @@
-\subsection{Selection of events}
+\section{Selection of events}
-\label{subsec:theotherthing}
+\label{sec:theotherthing}
--- a/text/problem/thething/contribution.tex
+++ b/text/problem/thething/contribution.tex
@ -1,7 +1,6 @@
-\section{Contribution}
+\subsection{Contribution}
-\label{sec:lmdk-contrib}
+\label{subsec:lmdk-contrib}
-In this chapter, we formally define a novel privacy notion that we call \emph{{\thething} privacy}.
+In this section, we formally define a novel privacy notion that we call \emph{{\thething} privacy}.
 We apply this privacy notion to time series consisting of \emph{{\thethings}} and regular events, and we design and implement three {\thething} privacy mechanisms.
 We further study {\thething} privacy under temporal correlation that is inherent in time series publishing.
 Finally, we evaluate {\thething} privacy with real and synthetic data sets, in settings with or without temporal correlation, showcasing the validity of our model.
--- a/text/problem/thething/main.tex
+++ b/text/problem/thething/main.tex
@ -1,24 +1,85 @@
-%\section{Significant events}
+\section{Significant events}
-%\label{sec:thething}
+\label{sec:thething}
 % Crowdsensing applications
 The plethora of sensors currently embedded in personal devices and other infrastructures have paved the way for the development of numerous \emph{crowdsensing services} (e.g.,~Ring~\cite{ring}, TousAntiCovid~\cite{tousanticovid}, Waze~\cite{waze}, etc.) based on the collected personal, and usually geotagged and timestamped data.
 % Continuously user-generated data
 User--service interactions gather personal event-like data, that are data items comprised of pairs of an identifying attribute of an individual and the---possibly sensitive---information at a timestamp (including contextual information), e.g.,~(\emph{`Bob', `dining', `Canal Saint-Martin', $17{:}00$}).
 When the interactions are performed in a continuous manner, we obtain ~\emph{time series} of events.
 % Observation/interaction duration
 Depending on the duration, we distinguish the interaction/observation into \emph{finite}, when taking place during a predefined time interval, and \emph{infinite}, when taking place in an uninterrupted fashion.
 Example~\ref{ex:scenario} shows the result of user--LBS interaction while retrieving location-based information or reporting user-state at various locations.
 \begin{example}
  \label{ex:scenario}
  Consider a finite sequence of spatiotemporal data generated by Bob during an interval of $8$ timestamps, as shown in Figure~\ref{fig:scenario}.
  Events in a shade correspond to privacy-sensitive events that Bob has defined beforehand. For instance his home is around {\'E}lys{\'e}e, his workplace is around the Louvre, and his hangout is around Canal Saint-Martin.
  \begin{figure}[htp]
    \centering
    \includegraphics[width=\linewidth]{lmdk-scenario}
    \caption{A time series with {\thethings} (highlighted in gray).
    }
    \label{fig:scenario}
  \end{figure}
 \end{example}
 % Privacy-preserving data processing
 The services collect and further process the time series in order to give useful feedback to the involved users or to provide valuable insight to various internal/external analytical services.
 The regulation regarding the processing of user-generated data sets~\cite{tankard2016gdpr} requires the provision of privacy guarantees to the users. 
 At the same time, it is essential to provide utility metrics to the final consumers of the privacy-preserving process output. 
 To accomplish this, various privacy techniques perturb the original data or the processing output at the expense of the overall utility of the final output.
 A widely recognized tool that introduces probabilistic randomness to the original data, while quantifying with a parameter $\varepsilon$ (`privacy budget'~\cite{mcsherry2009privacy}) the privacy/utility ratio is \emph{$\varepsilon$-differential privacy}~\cite{dwork2006calibrating}.
 Due to its \emph{composition} property, i.e.,~the combination of differentially private outputs satisfies differential privacy as well, differential privacy is suitable for privacy-preserving time series publishing.
 \emph{Event}, \emph{user}~\cite{dwork2010differential, dwork2010pan}, and \emph{$w$-event}~\cite{kellaris2014differentially} comprise the possible levels of privacy protection.
 Event-level limits the privacy protection to \emph{any single event}, user-level protects \emph{all the events} of any user, and $w$-event provides privacy protection to \emph{any sequence of $w$ events}.
 The privacy mechanisms for the aforementioned levels assume that in a time series any single event, or any sequence of events, or the entire series of events is equally privacy-significant for the users.
 In reality, this is an simplistic assumption.
 The significance of an event is related to certain user-defined privacy criteria, or to its adjacent events, as well as to the entire time series.
 We term significant events as \emph{{\thething} events} or simply \emph{\thethings}. 
 Identifying {\thethings} can be done in an automatic or manual way (but is out of scope for this work).
 For example, in spatiotemporal data, \emph{places where an individual spent some time} denote \emph{points of interest} (POIs) (called also stay points)~\cite{zheng2015trajectory}.
 Such events, and more particularly their spatial attribute values, can be less privacy-sensitive~\cite{primault2018long}, e.g.,~parks, theaters, etc. or, if individuals frequent them, they can reveal supplementary information, e.g.,~residences (home addresses)~\cite{gambs2010show}, places of worship (religious beliefs)~\cite{franceschi-bicchierairussell2015redditor}, etc.
 POIs can be an example of how we can choose {\thethings}, but the idea is not limited to these.
 Another example is the detection of privacy-sensitive user interactions by \emph{contact tracing} applications.
 This can be practical in decease control~\cite{eames2003contact}, similar to the recent outbreak of the Coronavirus disease 2019 (COVID-19) epidemic~\cite{ahmed2020survey}.
 Last but not least, {\thethings} in \emph{smart grid} electricity usage patterns could not only reveal the energy consumption of a user but also information regarding activities, e.g.,~`at work', `sleeping', etc. and types of appliances already installed or recently purchased~\cite{khurana2010smart}.
 \begin{example}
  \label{ex:st-cont}
  Figure~\ref{fig:st-cont} shows the case when we want to protect all of Bob's significant events ($p_1$, $p_3$, $p_5$, $p_8$) in his trajectory shown in Figure~\ref{fig:scenario}.
  % That is, we have to allocate privacy budget $\varepsilon$ such that at any timestamp $t$ it holds that $\varepsilon_t + \varepsilon_1 + \varepsilon_3 + \varepsilon_5 + \varepsilon_8 \leq \varepsilon$.
  In this scenario, event-level protection is not suitable since it can only protect one event at a time.
  Hence, we have to apply user-level privacy protection by distributing equal portions of $\varepsilon$ to all the events, i.e.,~$\frac{\varepsilon}{8}$ to each one (the equivalent of applying $8$-event privacy).
  In this way, we have protected the {\thething} points; we have allocated a total of $\frac{\varepsilon}{2}<\varepsilon$ to the {\thethings}. 
  \begin{figure}[htp]
    \centering
    \includegraphics[width=\linewidth]{st-cont}
    \caption{User-level and {\thething} $\varepsilon$-differential privacy protection for the time series of Figure~\ref{fig:scenario}.}
    \label{fig:st-cont}
  \end{figure}
  However, perturbing by $\frac{\varepsilon}{8}$ each regular point deteriorates the data utility unnecessarily.
  Notice that the overall privacy budget that we ended up allocating to the user-defined significant events is equal to $\frac{\varepsilon}{2}$ and leaves an equal amount of budget to distribute to any current event.
  In other words, uniformly allocating $\frac{\varepsilon}{5}$ to every event would still achieve the Bob's privacy goal, i.e.,~protect every significant event, while achieving better utility overall.
 \end{example}
 We argue that protecting only {\thething} events along with any regular event release is sufficient for the user's protection, while it improves data utility.
 Considering {\thething} events can prevent over-perturbing the data in the benefit of their final quality. 
 Take for example the scenario in Figure~\ref{fig:st-cont}, where {\thethings} are highlighted in gray.
 If we want to protect the {\thething} points, we have to allocate at most a budget of $\varepsilon$ to the {\thethings}, while saving some for the release of regular events.
 Essentially, the more budget we allocate to an event the less we protect it, but at the same time we maintain its utility.
 With {\thething} privacy we propose to distribute the budget taking into account only the existence of the {\thethings} when we release an event of the time series, i.e.,~allocating $\frac{\varepsilon}{5}$ ($4\ \text{\thethings} + 1\ \text{regular point}$) to each event (see  Figure~\ref{fig:st-cont}).
 This way, we still guarantee that the {\thethings} are  adequately protected, as they receive a total budget of $\frac{4\varepsilon}{5}<\varepsilon$. 
 At the same time, we avoid over-perturbing the regular events, as we allocate to them  a higher total budget ($\frac{4\varepsilon}{5}$) than in user-level ($\frac{\varepsilon}{2}$), and thus less noise. 
 <<<<<<< HEAD
 In this chapter, we propose a novel configurable privacy scheme, \emph{{\thething} privacy}, which takes into account significant events (\emph{\thethings}) in the time series and allocates the available privacy budget accordingly.
 We propose three privacy models that guarantee {\thething} privacy and validate our proposal on real and synthetic data sets.
 \kat{Now, you have space so you need to be more detailed in the discussions, the motivation, the examples etc.}
 \input{problem/thething/motivation}
 \input{problem/thething/contribution}
 \input{problem/thething/problem}
 \input{problem/thething/solution}
 =======
 In this chapter, we propose a novel configurable privacy scheme, \emph{\thething} privacy, which takes into account significant events (\emph{\thethings}) in the time series and allocates the available privacy budget accordingly.
 We propose two privacy models that guarantee {\thething} privacy.
 To further enhance our privacy method, and protect the landmarks position in the time series, we propose techniques to perturb the initial landmarks set (Section~\ref{sec:theotherthing}). 
 % and validate our proposal on real and synthetic data sets. \kat{this will go in the experiments section}
 \input{problem/thething/motivation}
 \input{problem/thething/contribution}
 \input{problem/thething/problem}
 \input{problem/theotherthing/main}
 >>>>>>> b334e056b320357ce4f4eaa89a1be7f3576350cf
 \input{problem/thething/summary}
--- a/text/problem/thething/motivation.tex
+++ b/text/problem/thething/motivation.tex
@ -1,80 +0,0 @@
 \section{Motivation}
 \label{sec:lmdk-motiv}
 % Crowdsensing applications
 The plethora of sensors currently embedded in personal devices and other infrastructures have paved the way for the development of numerous \emph{crowdsensing services} (e.g.,~Ring~\cite{ring}, TousAntiCovid~\cite{tousanticovid}, Waze~\cite{waze}, etc.) based on the collected personal, and usually geotagged and timestamped data.
 % Continuously user-generated data
 User--service interactions gather personal event-like data, that are data items comprised of pairs of an identifying attribute of an individual and the---possibly sensitive---information at a timestamp (including contextual information), e.g.,~(\emph{`Bob', `dining', `Canal Saint-Martin', $17{:}00$}).
 When the interactions are performed in a continuous manner, we obtain ~\emph{time series} of events.
 % Observation/interaction duration
 Depending on the duration, we distinguish the interaction/observation into \emph{finite}, when taking place during a predefined time interval, and \emph{infinite}, when taking place in an uninterrupted fashion.
 Example~\ref{ex:scenario} shows the result of user--LBS interaction while retrieving location-based information or reporting user-state at various locations.
 \begin{example}
  \label{ex:scenario}
  Consider a finite sequence of spatiotemporal data generated by Bob during an interval of $8$ timestamps, as shown in Figure~\ref{fig:scenario}.
  Events in a shade correspond to privacy-sensitive events that Bob has defined beforehand. For instance his home is around {\'E}lys{\'e}e, his workplace is around the Louvre, and his hangout is around Canal Saint-Martin.
  \begin{figure}[htp]
    \centering
    \includegraphics[width=\linewidth]{lmdk-scenario}
    \caption{A time series with {\thethings} (highlighted in gray).
    }
    \label{fig:scenario}
  \end{figure}
 \end{example}
 % Privacy-preserving data processing
 The services collect and further process the time series in order to give useful feedback to the involved users or to provide valuable insight to various internal/external analytical services.
 The regulation regarding the processing of user-generated data sets~\cite{tankard2016gdpr} requires the provision of privacy guarantees to the users. 
 At the same time, it is essential to provide utility metrics to the final consumers of the privacy-preserving process output. 
 To accomplish this, various privacy techniques perturb the original data or the processing output at the expense of the overall utility of the final output.
 A widely recognized tool that introduces probabilistic randomness to the original data, while quantifying with a parameter $\varepsilon$ (`privacy budget'~\cite{mcsherry2009privacy}) the privacy/utility ratio is \emph{$\varepsilon$-differential privacy}~\cite{dwork2006calibrating}.
 Due to its \emph{composition} property, i.e.,~the combination of differentially private outputs satisfies differential privacy as well, differential privacy is suitable for privacy-preserving time series publishing.
 \emph{Event}, \emph{user}~\cite{dwork2010differential, dwork2010pan}, and \emph{$w$-event}~\cite{kellaris2014differentially} comprise the possible levels of privacy protection.
 Event-level limits the privacy protection to \emph{any single event}, user-level protects \emph{all the events} of any user, and $w$-event provides privacy protection to \emph{any sequence of $w$ events}.
 The privacy mechanisms for the aforementioned levels assume that in a time series any single event, or any sequence of events, or the entire series of events is equally privacy-significant for the users.
 In reality, this is an simplistic assumption.
 The significance of an event is related to certain user-defined privacy criteria, or to its adjacent events, as well as to the entire time series.
 We term significant events as \emph{{\thething} events} or simply \emph{\thethings}. 
 Identifying {\thethings} can be done in an automatic or manual way (but is out of scope for this work).
 For example, in spatiotemporal data, \emph{places where an individual spent some time} denote \emph{points of interest} (POIs) (called also stay points)~\cite{zheng2015trajectory}.
 Such events, and more particularly their spatial attribute values, can be less privacy-sensitive~\cite{primault2018long}, e.g.,~parks, theaters, etc. or, if individuals frequent them, they can reveal supplementary information, e.g.,~residences (home addresses)~\cite{gambs2010show}, places of worship (religious beliefs)~\cite{franceschi-bicchierairussell2015redditor}, etc.
 POIs can be an example of how we can choose {\thethings}, but the idea is not limited to these.
 Another example is the detection of privacy-sensitive user interactions by \emph{contact tracing} applications.
 This can be practical in decease control~\cite{eames2003contact}, similar to the recent outbreak of the Coronavirus disease 2019 (COVID-19) epidemic~\cite{ahmed2020survey}.
 Last but not least, {\thethings} in \emph{smart grid} electricity usage patterns could not only reveal the energy consumption of a user but also information regarding activities, e.g.,~`at work', `sleeping', etc. and types of appliances already installed or recently purchased~\cite{khurana2010smart}.
 \begin{example}
  \label{ex:st-cont}
  Figure~\ref{fig:st-cont} shows the case when we want to protect all of Bob's significant events ($p_1$, $p_3$, $p_5$, $p_8$) in his trajectory shown in Figure~\ref{fig:scenario}.
  % That is, we have to allocate privacy budget $\varepsilon$ such that at any timestamp $t$ it holds that $\varepsilon_t + \varepsilon_1 + \varepsilon_3 + \varepsilon_5 + \varepsilon_8 \leq \varepsilon$.
  In this scenario, event-level protection is not suitable since it can only protect one event at a time.
  Hence, we have to apply user-level privacy protection by distributing equal portions of $\varepsilon$ to all the events, i.e.,~$\frac{\varepsilon}{8}$ to each one (the equivalent of applying $8$-event privacy).
  In this way, we have protected the {\thething} points; we have allocated a total of $\frac{\varepsilon}{2}<\varepsilon$ to the {\thethings}. 
  \begin{figure}[htp]
    \centering
    \includegraphics[width=\linewidth]{st-cont}
    \caption{User-level and {\thething} $\varepsilon$-differential privacy protection for the time series of Figure~\ref{fig:scenario}.}
    \label{fig:st-cont}
  \end{figure}
  However, perturbing by $\frac{\varepsilon}{8}$ each regular point deteriorates the data utility unnecessarily.
  Notice that the overall privacy budget that we ended up allocating to the user-defined significant events is equal to $\frac{\varepsilon}{2}$ and leaves an equal amount of budget to distribute to any current event.
  In other words, uniformly allocating $\frac{\varepsilon}{5}$ to every event would still achieve the Bob's privacy goal, i.e.,~protect every significant event, while achieving better utility overall.
 \end{example}
 We argue that protecting only {\thething} events along with any regular event release is sufficient for the user's protection, while it improves data utility.
 Considering {\thething} events can prevent over-perturbing the data in the benefit of their final quality. 
 Take for example the scenario in Figure~\ref{fig:st-cont}, where {\thethings} are highlighted in gray.
 If we want to protect the {\thething} points, we have to allocate at most a budget of $\varepsilon$ to the {\thethings}, while saving some for the release of regular events.
 Essentially, the more budget we allocate to an event the less we protect it, but at the same time we maintain its utility.
 With {\thething} privacy we propose to distribute the budget taking into account only the existence of the {\thethings} when we release an event of the time series, i.e.,~allocating $\frac{\varepsilon}{5}$ ($4\ \text{\thethings} + 1\ \text{regular point}$) to each event (see  Figure~\ref{fig:st-cont}).
 This way, we still guarantee that the {\thethings} are  adequately protected, as they receive a total budget of $\frac{4\varepsilon}{5}<\varepsilon$. 
 At the same time, we avoid over-perturbing the regular events, as we allocate to them  a higher total budget ($\frac{4\varepsilon}{5}$) than in user-level ($\frac{\varepsilon}{2}$), and thus less noise. 
--- a/text/problem/thething/problem.tex
+++ b/text/problem/thething/problem.tex
@ -1,11 +1,8 @@
-<<<<<<< HEAD
+\subsection{Problem definition}
 \subsection{Problem description and definition}
 \label{subsec:lmdk-prob}
 =======
 \section{{\Thething} privacy}
 \label{sec:lmdk-prob}
 >>>>>>> b334e056b320357ce4f4eaa89a1be7f3576350cf
 \subsubsection{Setting}
 \label{subsec:lmdk-set}
 Our problem setting consists of three entities: (i) data generators (users), (ii) data publishers (trusted non-adversarial entities), and (iii) data consumers (possibly adversarial entities). 
 Users generate sensitive data, which are processed in a secure and private way by a trusted curator and are later published in order to be consumed by potentially adversarial data analysts. 
 %The data unit produced by the users is an \emph{event}, i.e., a piece of timestamped user-related information.\kat{should we say geo-stamped?}. 
@ -33,7 +30,7 @@ Notice that, in a real life scenario, $E_g$ and $E_c$ might overlap with each ot
 \subsubsection{Privacy goal}
-\label{subsec:prv-g}
+\label{subsec:lmdk-goal}
 We argue that in continuous user-generated data publishing, events are not equally `significant' in terms of privacy.
 % We term a significant event---according to user- or data-related criteria---as a \emph{\thething}~event.
--- a/text/problem/thething/solution.tex
+++ b/text/problem/thething/solution.tex
@ -1,7 +1,6 @@
 \subsection{Achieving {\thething} privacy}
 \label{subsec:lmdk-sol}
 \subsubsection{{\Thething} privacy mechanisms}
 \label{subsec:lmdk-mechs}
 % \kat{add the two models -- uniform and dynamic  and skip}
@ -132,6 +131,7 @@ to the next timestamps.
 \subsubsection{{\Thething} privacy under temporal correlation}
 \label{subsec:lmdk-cor}
 From the discussion so far, it is evident that for the budget distribution it is not the positions but rather the number of the {\thethings} that matters.
 However, this is not the case under the presence of temporal correlation, which is inherent in continuously generated data.
`@ -1,2 +1,2 @@`
	`\subsection{Selection of events}`	`\section{Selection of events}`
	`\label{subsec:theotherthing}`	`\label{sec:theotherthing}`