From 4673e560eeeb8860a6f84b111a7927d523eb8ebe Mon Sep 17 00:00:00 2001 From: Manos Date: Fri, 8 Oct 2021 20:12:55 +0200 Subject: [PATCH] thething: WIP --- text/problem/main.tex | 3 +- text/problem/thething/main.tex | 5 +- text/problem/thething/motivation.tex | 79 ++++--- text/problem/thething/problem.tex | 328 ++++----------------------- 4 files changed, 91 insertions(+), 324 deletions(-) diff --git a/text/problem/main.tex b/text/problem/main.tex index c97cbd4..59a5747 100644 --- a/text/problem/main.tex +++ b/text/problem/main.tex @@ -1,4 +1,5 @@ -\chapter{The problem} +\chapter{Landmark privacy} +\label{ch:thething-prv} \input{problem/thething/main} \input{problem/theotherthing/main} diff --git a/text/problem/thething/main.tex b/text/problem/thething/main.tex index b19fcf5..11a5169 100644 --- a/text/problem/thething/main.tex +++ b/text/problem/thething/main.tex @@ -1,10 +1,11 @@ \section{Significant events} \label{sec:thething} -In this chapter, we propose a novel configurable privacy scheme, \emph{\thething} privacy, which takes into account significant events (\emph{\thethings}) in the time series and allocates the available privacy budget accordingly. -We propose two privacy models that guarantee {\thething} privacy and validate our proposal on real and synthetic data sets. +In this chapter, we propose a novel configurable privacy scheme, \emph{{\thething} privacy}, which takes into account significant events (\emph{\thethings}) in the time series and allocates the available privacy budget accordingly. +We propose three privacy models that guarantee {\thething} privacy and validate our proposal on real and synthetic data sets. \kat{Now, you have space so you need to be more detailed in the discussions, the motivation, the examples etc.} \input{problem/thething/motivation} \input{problem/thething/contribution} \input{problem/thething/problem} +\input{problem/thething/solution} \input{problem/thething/summary} diff --git a/text/problem/thething/motivation.tex b/text/problem/thething/motivation.tex index 3899f63..455fcd2 100644 --- a/text/problem/thething/motivation.tex +++ b/text/problem/thething/motivation.tex @@ -1,63 +1,80 @@ \subsection{Motivation} \label{subsec:lmdk-motiv} -The plethora of sensors currently embedded in -or paired with personal devices and other infrastructures have paved the way for the development of numerous \emph{crowdsensing services} (e.g.,~Google Maps~\cite{gmaps}, Waze~\cite{waze}, etc.) based on the collected personal, and usually geotagged and timestamped data. - -User--service interactions gather personal event-like data, e.g.,~(\emph{`Bob', `dining', `Canal Saint-Martin', $17{:}00$}). +% Crowdsensing applications +The plethora of sensors currently embedded in personal devices and other infrastructures have paved the way for the development of numerous \emph{crowdsensing services} (e.g.,~Ring~\cite{ring}, TousAntiCovid~\cite{tousanticovid}, Waze~\cite{waze}, etc.) based on the collected personal, and usually geotagged and timestamped data. +% Continuously user-generated data +User--service interactions gather personal event-like data, that are data items comprised of pairs of an identifying attribute of an individual and the---possibly sensitive---information at a timestamp (including contextual information), e.g.,~(\emph{`Bob', `dining', `Canal Saint-Martin', $17{:}00$}). When the interactions are performed in a continuous manner, we obtain ~\emph{time series} of events. -An \emph{event} represents a user--service interaction, registering the information of the individual at a specific time point, i.e.,~a data item that is a pair of an identifying attribute of an individual and the---possibly sensitive---information at a timestamp (including contextual information). -It can be seen as a correspondence to a record in a database, where each individual may participate once, e.g.,~(`Bob', `dining', `Canal Saint-Martin', $5$). -Typically, users interact with the services more than once, generating data in a continuous manner (\emph{time series}). -The services collect and further process the time series in order to give useful feedback to the involved users or to provide valuable insight to various internal/external analytical services. -Depending on its span, we distinguish the processing into \emph{finite}, when taking place during a predefined time interval, and \emph{infinite}, when taking place in an uninterrupted fashion. -% Figure~\ref{fig:scenario} shows an example of a finite time series produced by a user (Bob) and composed by $8$ timestamps during his trajectory from his home (\'Elys\'ee) to his work (Louvre) to his hangout (Saint-Martin) and back to his home. - +% Observation/interaction duration +Depending on the duration, we distinguish the interaction/observation into \emph{finite}, when taking place during a predefined time interval, and \emph{infinite}, when taking place in an uninterrupted fashion. +Example~\ref{ex:scenario} shows the result of user--LBS interaction while retrieving location-based information or reporting user-state at various locations. + \begin{example} - \label{ex:lmdk-scenario} + \label{ex:scenario} - Consider a finite sequence of spatiotemporal data generated by Bob during an interval of $\ 8$ timestamps, as shown in Figure~\ref{fig:lmdk-scenario}. - These data are the result of user--LBS interaction while retrieving location-based information or reporting user-state at various locations. + Consider a finite sequence of spatiotemporal data generated by Bob during an interval of $8$ timestamps, as shown in Figure~\ref{fig:scenario}. Events in a shade correspond to privacy-sensitive events that Bob has defined beforehand. For instance his home is around {\'E}lys{\'e}e, his workplace is around the Louvre, and his hangout is around Canal Saint-Martin. \begin{figure}[htp] \centering \includegraphics[width=\linewidth]{lmdk-scenario} - \caption{A time series with {\thethings} (highlighted in gray).} - \label{fig:lmdk-scenario} + \caption{A time series with {\thethings} (highlighted in gray). + } + \label{fig:scenario} \end{figure} \end{example} +% Privacy-preserving data processing +The services collect and further process the time series in order to give useful feedback to the involved users or to provide valuable insight to various internal/external analytical services. The regulation regarding the processing of user-generated data sets~\cite{tankard2016gdpr} requires the provision of privacy guarantees to the users. At the same time, it is essential to provide utility metrics to the final consumers of the privacy-preserving process output. +To accomplish this, various privacy techniques perturb the original data or the processing output at the expense of the overall utility of the final output. A widely recognized tool that introduces probabilistic randomness to the original data, while quantifying with a parameter $\varepsilon$ (`privacy budget'~\cite{mcsherry2009privacy}) the privacy/utility ratio is \emph{$\varepsilon$-differential privacy}~\cite{dwork2006calibrating}. +Due to its \emph{composition} property, i.e.,~the combination of differentially private outputs satisfies differential privacy as well, differential privacy is suitable for privacy-preserving time series publishing. \emph{Event}, \emph{user}~\cite{dwork2010differential, dwork2010pan}, and \emph{$w$-event}~\cite{kellaris2014differentially} comprise the possible levels of privacy protection. Event-level limits the privacy protection to \emph{any single event}, user-level protects \emph{all the events} of any user, and $w$-event provides privacy protection to \emph{any sequence of $w$ events}. The privacy mechanisms for the aforementioned levels assume that in a time series any single event, or any sequence of events, or the entire series of events is equally privacy-significant for the users. In reality, this is an simplistic assumption. The significance of an event is related to certain user-defined privacy criteria, or to its adjacent events, as well as to the entire time series. -Identifying \emph{\thething} (significant) events can be done in an automatic or manual way (but is out of scope for this work). +We term significant events as \emph{{\thething} events} or simply \emph{\thethings}. +Identifying {\thethings} can be done in an automatic or manual way (but is out of scope for this work). For example, in spatiotemporal data, \emph{places where an individual spent some time} denote \emph{points of interest} (POIs) (called also stay points)~\cite{zheng2015trajectory}. -Such data items, and more particularly their spatial attribute values, can be less privacy-sensitive~\cite{primault2018long}, e.g.,~parks, theaters, etc. or, if individuals frequent them, they can reveal supplementary information, e.g.,~residences (home addresses)~\cite{gambs2010show}, places of worship (religious beliefs)~\cite{franceschi-bicchierairussell2015redditor}, etc. +Such events, and more particularly their spatial attribute values, can be less privacy-sensitive~\cite{primault2018long}, e.g.,~parks, theaters, etc. or, if individuals frequent them, they can reveal supplementary information, e.g.,~residences (home addresses)~\cite{gambs2010show}, places of worship (religious beliefs)~\cite{franceschi-bicchierairussell2015redditor}, etc. POIs can be an example of how we can choose {\thethings}, but the idea is not limited to these. +Another example is the detection of privacy-sensitive user interactions by \emph{contact tracing} applications. +This can be practical in decease control~\cite{eames2003contact}, similar to the recent outbreak of the Coronavirus disease 2019 (COVID-19) epidemic~\cite{ahmed2020survey}. +Last but not least, {\thethings} in \emph{smart grid} electricity usage patterns could not only reveal the energy consumption of a user but also information regarding activities, e.g.,~`at work', `sleeping', etc. and types of appliances already installed or recently purchased~\cite{khurana2010smart}. -\begin{figure}[htp] - \centering - \includegraphics[width=\linewidth]{st-cont} - \caption{User-level and {\thething} $\varepsilon$-differential privacy protection for the time series of Figure~\ref{fig:lmdk-scenario}.} - \label{fig:st-cont} -\end{figure} +\begin{example} + \label{ex:st-cont} + + Figure~\ref{fig:st-cont} shows the case when we want to protect all of Bob's significant events ($p_1$, $p_3$, $p_5$, $p_8$) in his trajectory shown in Figure~\ref{fig:scenario}. + % That is, we have to allocate privacy budget $\varepsilon$ such that at any timestamp $t$ it holds that $\varepsilon_t + \varepsilon_1 + \varepsilon_3 + \varepsilon_5 + \varepsilon_8 \leq \varepsilon$. + In this scenario, event-level protection is not suitable since it can only protect one event at a time. + Hence, we have to apply user-level privacy protection by distributing equal portions of $\varepsilon$ to all the events, i.e.,~$\frac{\varepsilon}{8}$ to each one (the equivalent of applying $8$-event privacy). + In this way, we have protected the {\thething} points; we have allocated a total of $\frac{\varepsilon}{2}<\varepsilon$ to the {\thethings}. + + \begin{figure}[htp] + \centering + \includegraphics[width=\linewidth]{st-cont} + \caption{User-level and {\thething} $\varepsilon$-differential privacy protection for the time series of Figure~\ref{fig:scenario}.} + \label{fig:st-cont} + \end{figure} + + However, perturbing by $\frac{\varepsilon}{8}$ each regular point deteriorates the data utility unnecessarily. + Notice that the overall privacy budget that we ended up allocating to the user-defined significant events is equal to $\frac{\varepsilon}{2}$ and leaves an equal amount of budget to distribute to any current event. + In other words, uniformly allocating $\frac{\varepsilon}{5}$ to every event would still achieve the Bob's privacy goal, i.e.,~protect every significant event, while achieving better utility overall. + +\end{example} We argue that protecting only {\thething} events along with any regular event release is sufficient for the user's protection, while it improves data utility. +Considering {\thething} events can prevent over-perturbing the data in the benefit of their final quality. Take for example the scenario in Figure~\ref{fig:st-cont}, where {\thethings} are highlighted in gray. -If we want to protect the {\thething} points, we have to allocate at most a budget of $\varepsilon$ to the {\thethings}. -Notice that the more budget we allocate to an event the less we protect it, but at the same time we maintain its utility. -In this scenario, event-level protection is not suitable since it can only protect one event at a time. -Hence, we have to apply user-level privacy protection by distributing equal portions of $\varepsilon$ to all the events, i.e.,~$\frac{\varepsilon}{8}$ to each one (the equivalent of applying $8$-event privacy). -In this way, we have protected the {\thething} points; we have allocated a total of $\frac{\varepsilon}{2}<\varepsilon$ to the {\thethings}. -However, perturbing by $\frac{\varepsilon}{8}$ each regular point deteriorates the data utility unnecessarily. +If we want to protect the {\thething} points, we have to allocate at most a budget of $\varepsilon$ to the {\thethings}, while saving some for the release of regular events. +Essentially, the more budget we allocate to an event the less we protect it, but at the same time we maintain its utility. With {\thething} privacy we propose to distribute the budget taking into account only the existence of the {\thethings} when we release an event of the time series, i.e.,~allocating $\frac{\varepsilon}{5}$ ($4\ \text{\thethings} + 1\ \text{regular point}$) to each event (see Figure~\ref{fig:st-cont}). This way, we still guarantee that the {\thethings} are adequately protected, as they receive a total budget of $\frac{4\varepsilon}{5}<\varepsilon$. -At the same time, we avoid over-perturbing the regular events, as we allocate to them a higher total budget ($\frac{4\varepsilon}{5}$) than in user-level ($\frac{\varepsilon}{2}$), and thus less noise. +At the same time, we avoid over-perturbing the regular events, as we allocate to them a higher total budget ($\frac{4\varepsilon}{5}$) than in user-level ($\frac{\varepsilon}{2}$), and thus less noise. diff --git a/text/problem/thething/problem.tex b/text/problem/thething/problem.tex index ecd167c..53fc7b4 100644 --- a/text/problem/thething/problem.tex +++ b/text/problem/thething/problem.tex @@ -1,66 +1,7 @@ -\subsection{{\Thething} privacy} +\subsection{Problem description and definition} \label{subsec:lmdk-prob} -{\Thething} privacy is based on differential privacy. -For this reason, we revisit the definition and important properties of differential privacy before moving on to the main ideas of this paper. -Although, its local variant~\cite{duchi2013local} is more compatible with microdata, which is our use case, for the shake of simplicity we stick to the original version of differential privacy. -We refer the interested reader to~\cite{desfontaines2020sok} for a systematic taxonomy of the different variants and extensions of differential privacy, to~\cite{katsomallos2019privacy} for a survey of privacy models for continuous data publishing, and to~\cite{primault2018long} for an organization of the recent contributions in location privacy. - - -\subsubsection{Differential privacy} -\label{subsec:dp} - -\emph{Differential privacy}~\cite{dwork2006calibrating} is a property of a privacy mechanism $\mathcal{M}$ processing a set of \emph{privacy-sensitive} personal data $D$, -%from a domain $\mathcal{D}$, -while providing quantifiable privacy and utility guarantees. -More specifically, $\mathcal{M}$ satisfies $\varepsilon$-differential privacy for a given `privacy budget' $\varepsilon \in \mathbb{R^+}$, if the ratio of the probabilities of $D$ and $D'$ being true worlds is lower or equal to $e^\varepsilon$, where $D'$ differs in one tuple from $D$. -%cannot decide sure that a tuple exists in the database or not. for every pair of data sets $D, D' $ -% \in \mathcal{D}$ -%, as defined in Definition~\ref{def:dp}. - - -%\begin{definition} -% [Differential privacy~\cite{dwork2006calibrating}] -% \label{def:dp} -% A privacy mechanism $\mathcal{M}$ with domain $\mathcal{D}$ and range $\mathcal{O}$, satisfies $\varepsilon$-differential privacy, for a given privacy budget $\varepsilon \in \mathbb{R}+$, if for every pair of data sets $D, D' $ -% % \in \mathcal{D}$ -% differing in one tuple and all sets $O$ it holds that: -% %\subseteq \mathcal{O}$: -% $$\Pr[\mathcal{M}(D) \in O] \leq e^\varepsilon \Pr[\mathcal{M}(D') \in O]$$ -%\end{definition} -% - -A widely used privacy mechanism is the \emph{Laplace} -% \kat{add exponential, maybe put references only} -~\cite{dwork2014algorithmic}, which draws randomly a value from the probability distribution of $\textrm{Laplace}(\mu, \frac{\Delta f}{\varepsilon})$. -$\mu$ stands for the original output of the query associated to $\mathcal{M}$ with sensitivity $\Delta f$, and $\frac{\Delta f}{\varepsilon}$ is the scale of the distribution. -% \kat{do we need to know the details of the mechanisms? I would rather put an intuitive description, or in which cases each is preferred.} -% \mk{Most probably we'll need the exponential and Geo-I.} -%A typical example of differential privacy mechanism is the \emph{Laplace mechanism}~\cite{dwork2014algorithmic}. -%It draws randomly a value from the probability distribution of $\textrm{Laplace}(\mu, b)$, where $\mu$ stands for the location parameter and $b > 0$ the scale parameter. -%Here, $\mu$ is equal to the original output value of a query function, and $b$ is the sensitivity of the query function divided by $\varepsilon$. -%The Laplace mechanism works for any function with range the set of real numbers. -A specialization of this mechanism for location data is the \emph{Planar Laplace mechanism}~\cite{andres2013geo}, which is based on a multivariate Laplace distribution and offers a level of protection equal to $\varepsilon$ times a desired protection radius. - -Mechanisms that satisfy differential privacy are \emph{composable}. -%, i.e.,~the combination of their results satisfies differential privacy as well. -The \emph{sequential} composition of $\mathcal{M}_1(D)$, $\mathcal{M}_2(D)$ with $\varepsilon_1$, $\varepsilon_2$ -%applied on the same $D$ -results in $\mathcal{M}(D)$ with $\varepsilon = \varepsilon_1 + \varepsilon_2$. -% The \emph{parallel} composition of $\mathcal{M}_1(D_1),\mathcal{M}_2(D_2)$ with $\varepsilon_1,\varepsilon_2$ results in $\mathcal{M}(D_1\cup D_2)$ with $\varepsilon{=}max(\varepsilon_1,\varepsilon_2)$. The post-processing of the results of a differential private mechanism does not deteriorate the entailed privacy. -% \mk{We don't need it now} -% -%\kat{integrate with next subsection} -%The presence of temporal correlations might result into additional privacy loss due to data releases performed in previous -- \emph{backward privacy loss} $\alpha^B$ -- and subsequent --\emph{forward privacy loss} $\alpha^F$~\cite{cao2017quantifying} -- timestamps.\kat{review -- complete further if space} -%Cao et al.~\cite{cao2017quantifying} propose a method for computing the total temporal privacy loss (TPL) in the presence of temporal correlations and background knowledge. Due to the lack of space, we refer the interested reader to the original publication for the complete definitions and formulas. - - -\subsubsection{Problem description and definition} -\label{subsec:prob-set} - -%\kat{move flowchart here} - -%Our problem setting consists of three entities: (i) data generators (users), (ii) data publishers (trusted non-adversarial entities), and (iii) data consumers (possibly adversarial entities). +Our problem setting consists of three entities: (i) data generators (users), (ii) data publishers (trusted non-adversarial entities), and (iii) data consumers (possibly adversarial entities). Users generate sensitive data, which are processed in a secure and private way by a trusted curator and are later published in order to be consumed by potentially adversarial data analysts. %The data unit produced by the users is an \emph{event}, i.e., a piece of timestamped user-related information.\kat{should we say geo-stamped?}. Data are produced as a series of events, which we call time series. @@ -68,76 +9,62 @@ An \emph{event} is defined as a triple of an identifying attribute of an individ %This workflow is repeated in a continuous manner, producing series of events, which we call time series. %, producing, processing, publishing, and consuming events in a private manner. %\kat{keep only the terms with a small description.} -%\begin{enumerate}[(i)] -% -% \item \textbf{Data generators} (users) entity $E_g$ interacts with a crowdsensing application and produces continuously privacy-sensitive data items in an arbitrary frequency during the application's usage period $T = (t)_{t \in \mathbb{N}}$. -% Thus, at each timestamp $t$, $E_g$ generates a data set $D_t \in \mathcal{D}$ where each of its members contributes a single data item. -% -% \item \textbf{Data publishers} (trusted non-adversarial) entity $E_p$ receives the data sent by $E_g$ in the form of a series of events in $T$. -% Following the \emph{global} processing and publishing scheme, $E_p$ collects at $t$ a data set $D_t$ and privacy-protects it by applying the respective privacy mechanism $\mathcal{M}_t$. -% $\mathcal{M}_t$ uses independent randomness such that it satisfies $\varepsilon_t$-differential privacy. -% -% \item \textbf{Data consumers} (possibly adversarial) entity $E_c$ receives the result $\mathbf{o}_t$ of the privacy-preserving processing of $D_t$ by $E_p$. -% According to Theorem~\ref{theor:compo-seq-ind}, the overall privacy guarantee of the outputs of $\mathcal{M}$ is equal to the sum of all the privacy budgets of the respective privacy mechanisms that compose $\mathcal{M}$, i.e.,~$\sum_{t \in T}\varepsilon_t$. -% -%\end{enumerate} -% -%We assume that all the interactions between $E_g$ and $E_p$ are secure and private, and thus $E_p$ is considered trusted and non-adversarial by $E_g$. -%Notice that, in a real life scenario, $E_g$ and $E_c$ might overlap with each other, i.e.,~data producers might be data consumers as well. -% -% -%\subsection{Privacy goal} -%\label{subsec:prv-g} +\begin{enumerate}[(i)] + + \item \textbf{Data generators} (users) entity $E_g$ interacts with a crowdsensing application and produces continuously privacy-sensitive data items in an arbitrary frequency during the application's usage period $T = (t)_{t \in \mathbb{N}}$. + Thus, at each timestamp $t$, $E_g$ generates a data set $D_t \in \mathcal{D}$ where each of its members contributes a single data item. + + \item \textbf{Data publishers} (trusted non-adversarial) entity $E_p$ receives the data sent by $E_g$ in the form of a series of events in $T$. + Following the \emph{global} processing and publishing scheme, $E_p$ collects at $t$ a data set $D_t$ and privacy-protects it by applying the respective privacy mechanism $\mathcal{M}_t$. + $\mathcal{M}_t$ uses independent randomness such that it satisfies $\varepsilon_t$-differential privacy. + + \item \textbf{Data consumers} (possibly adversarial) entity $E_c$ receives the result $\mathbf{o}_t$ of the privacy-preserving processing of $D_t$ by $E_p$. + According to Theorem~\ref{theor:compo-seq-ind}, the overall privacy guarantee of the outputs of $\mathcal{M}$ is equal to the sum of all the privacy budgets of the respective privacy mechanisms that compose $\mathcal{M}$, i.e.,~$\sum_{t \in T}\varepsilon_t$. + +\end{enumerate} + +We assume that all the interactions between $E_g$ and $E_p$ are secure and private, and thus $E_p$ is considered trusted and non-adversarial by $E_g$. +Notice that, in a real life scenario, $E_g$ and $E_c$ might overlap with each other, i.e.,~data producers might be data consumers as well. + + +\subsubsection{Privacy goal} +\label{subsec:prv-g} We argue that in continuous user-generated data publishing, events are not equally `significant' in terms of privacy. -% (including contextual information). -%It can be seen as a correspondence to a record in a database, where each individual may participate once. -We term a significant event---according to user- or data-related criteria---as a \emph{\thething}~event. -% As mentioned in Section~\ref{sec:intro}, t +% We term a significant event---according to user- or data-related criteria---as a \emph{\thething}~event. The identification of {\thething} events can be performed manually or automatically~\cite{zhou2004discovering, hariharan2004project}, and is an orthogonal problem to this current work. -%We defer the study of the {\thethings} discovery to a following work. In this work, we consider the {\thething} timestamps non-sensitive and provided by the user as input along with the privacy budget $\varepsilon$. -% \kat{check that this is mentioned in the intro} -For example, events $p_1$, $p_3$, $p_5$, $p_8$ in Figure~\ref{fig:lmdk-scenario} are {\thething} events. -% relevant to certain user-defined privacy criteria, or to its adjacent data item(s) as well as to the entire data set or parts thereof. +For example, events $p_1$, $p_3$, $p_5$, $p_8$ in Figure~\ref{fig:scenario} are {\thething} events. +We give the definition of {\thethings} below (Definition~\ref{def:thething-evnt}). % A significant event or item signals its consequence to us, toward us. % https://www.quora.com/What-is-the-difference-between-significant-and-important \begin{definition} - [{\Thething} event] - \label{def:thething-evnt} - A {\thething} event is a significant---according to user- or data-related criteria---user-generated data item. + [{\Thething} event] + \label{def:thething-evnt} + A {\thething} event is a significant---according to user- or data-related criteria---user-generated data item. \end{definition} +Definition~\ref{def:thething-nb} extends the notion of neighboring data sets to the context of {\thethings}. +\begin{definition} + [{\Thething} neighboring time series] + \label{def:thething-nb} + Two time series of equal lengths are \emph{{\thething} neighboring} when they differ by a single {\thething} event. +\end{definition} -%In this scenario, these are $1$, $3$, $5$, $8$ and they fall in areas in a dark shade. -%\kat{we must define a series of events before neighbouring series of events} - -%\begin{definition} -% [{\Thething} neighboring time series] -% \label{def:thething-nb} -% Two time series are {\thething} neighboring (or adjacent) when they differ by a single {\thething} event. -%% i.e.,~one can be obtained by adding/removing a {\thething} to/from the other. -%\end{definition} - -Two time series of equal lengths are \emph{{\thething} neighboring} when they differ by a single {\thething} event. -For example, the time series ($p_1$, \dots, $p_8$) with {\thethings} set the $\{p_1, p_3,p_5\}$ is {\thething} neighboring to the time series of Figure~\ref{fig:lmdk-scenario}. -%This means that we can obtain the first time series by adding/removing one event to/from the second time series. -%to/from any one of two {\thething} neighboring series of events we can obtain the other series. +For example, the time series ($p_1$, \dots, $p_8$) with {\thethings} set the \{$p_1$, $p_3$, $p_5$\} is {\thething} neighboring to the time series of Figure~\ref{fig:scenario}. Therefore, Corollary~\ref{cor:thething-nb} follows. \begin{corollary} \label{cor:thething-nb} Two {\thething} neighboring time series are event neighboring as well. \end{corollary} -%\kat{what is event neighboring?} We proceed to propose \emph{{\thething} privacy}, a configurable variation of differential privacy for time series (Definition~\ref{def:thething-prv}). -%\kat{Up to now M was a mechanism, now it is a set of mechanisms?} \begin{definition} - % [{\Thething} privacy] + [{\Thething} privacy] \label{def:thething-prv} Let $\mathcal{M}$ be a privacy mechanism with range $\mathcal{O}$ that takes as input a time series. We say that $\mathcal{M}$ satisfies {\thething} $\varepsilon$-differential privacy (or, simply, {\thething} privacy) if for all sets of possible outputs $O \subseteq \mathcal{O}$, and for every pair of {\thething}-neighboring time series $S_T$, $S_T'$, @@ -146,31 +73,18 @@ We proceed to propose \emph{{\thething} privacy}, a configurable variation of di $$Pr[\mathcal{M}(S_T) \in O] \leq e^\varepsilon Pr[\mathcal{M}(S_T') \in O]$$ \end{definition} -% \kat{to rephrase for an easier transition -- mention here user and event level that satisfy {\thething} privacy and add discussion that we can do better and propose the new mechanism} -As discussed in Section~\ref{subsec:prv-levels}, user-level privacy can achieve {\thething} privacy, but it over-perturbs the final data by not distinguishing into {\thething} and regular events. +User-level privacy can achieve {\thething} privacy, but it over-perturbs the final data by not distinguishing into {\thething} and regular events. Theorem~\ref{theor:thething-prv} proposes how to achieve the desired privacy for the {\thethings} (i.e.,~a total budget lower than $\varepsilon$), and in the same time provide better quality overall. - -% the existing protection levels of differential privacy do not provide adequate control in time series publishing. -%In Figure~\ref{fig:st-cont} we exhibited how additional user preferences can impact the necessary privacy/utility tradeoff. -%We introduce the notion of {\thethings} and propose {\thething} privacy, a configurable variation of differential privacy for time series. -%By taking into account {\thethings}, i.e.,~timestamps were significant events take place, {\thething} privacy can provide a satisfying protection level while not perturbing the original time series unnecessarily. - -%In mereology, the formal study on the relation between parts and the entities they form, it is generally held that the identity of an observable object depends on its \emph{spatiotemporal continuity}~\cite{wiggins1967identity, scaltsas1981identity, hazarika2001qualitative}: the property of well-behaved objects that alter their state in harmony with space and time. -%Considering events that span the entirety of the user-generated time series ensures the spatiotemporal continuity of the users. -%This way, it is possible to acquire more information regarding individuals' identities, and design privacy preserving methods that offer improved privacy and utility guarantees. \begin{theorem} - % [{\Thething} privacy] + [{\Thething} privacy] \label{theor:thething-prv} - % A privacy mechanism that protects any timestamp all the {\thething} events in a time series, satisfies {\thething} privacy. Let $\mathcal{M}$ be a mechanism with input a time series $S_T$, where $T$ is the set of the involved timestamps, and $L \subseteq T$ be the set of {\thething} timestamps. $\mathcal{M}$ is decomposed to $\varepsilon$-differential private sub-mechanisms $\mathcal{M}_t$, for every $t \in T$, that apply independent randomness to the data item at $t$. Then, given a privacy budget $\varepsilon$, $\mathcal{M}$ satisfies {\thething} privacy if for every $t$ it holds that $$ \sum_{i\in L \cup \{t\}} \varepsilon_i \leq \varepsilon$$ \end{theorem} -% \mk{To discuss.} -% Due to space constraints, we omit the proof of Theorem~\ref{theor:thething-prv} and defer it for a longer version of this paper. \begin{proof} \label{pf:thething-prv} All mechanisms use independent randomness, and therefore for a time series $S_T = {D_1, \dots, D_T}$ and outputs $(\pmb{o}_1, \dots, \pmb{o}_T) \in O \subseteq \mathcal{O}$ it holds that @@ -196,169 +110,3 @@ Theorem~\ref{theor:thething-prv} proposes how to achieve the desired privacy for If the formula of Theorem~\ref{theor:thething-prv} holds, then we get $\frac{Pr[\mathcal{M}(S_T) \in O}{Pr[\mathcal{M}(S'_T) \in O]} \leq e^\varepsilon$. Due to Definition~\ref{def:thething-prv} this concludes our proof. \end{proof} - - -\subsubsection{{\Thething} privacy mechanisms} -\label{subsec:lmdk-mechs} -% \kat{add the two models -- uniform and dynamic and skip} - -%\kat{isn't the uniform distribution a method? there is a section for the methods. } -Figure~\ref{fig:st-cont} shows the simplest model that implements Theorem~\ref{theor:thething-prv}, the \textbf{Uniform} distribution of privacy budget $\varepsilon$ for {\thething} privacy. -% \mk{We capitalize the first letter because it's the name of the method.} -% in comparison with user-level protection. -In this case, it is enough to distribute at each timestamp the total privacy budget divided by the number of timestamps corresponding to {\thethings}, plus one if we are releasing a regular timestamp. -Consequently, at each timestamp we protect every {\thething}, while reserving a part of $\varepsilon$ for the current timestamp. -%In this case, distributing $\frac{\varepsilon}{5}$ can guarantee {\thething} privacy. - - - -% \begin{figure}[htp] -% \centering -% \includegraphics[width=0.9\linewidth]{thething-prv} -% \caption{Uniform application scenario of {\thething} privacy.} -% \label{fig:thething-prv} -% \end{figure} - -Next, we propose an \textbf{Adaptive} privacy mechanism taking into account changes in the input data and exploiting the post-processing property of differential privacy. -Initially, it reserves uniformly the available privacy budget for each future release. -At each timestamp, based on a sampling rate the mechanism either publishes with noise the original data or it releases an approximation based on previous releases. -In the case when it publishes with noise the original data, it also calculates the difference between the current and the previous release and compares the difference with the scale of the perturbation ($\frac{\Delta f}{\varepsilon}$). -The outcome of this comparison determines the adaptation of the sampling rate for the next events: -if the scale is greater it means that the input has not changed much, and therefore it must decrease the sampling rate. -In the case when the mechanism approximates a {\thething} (but not a regular timestamp), it distributes the reserved privacy budget -% divided by the number of remaining {\thething} plus one -to the next timestamps. - -% Why skipping publications is problematic? -One might argue that we could \textbf{Skip} the \thething\ data releases. -% and limit the number of {\thethings}. -This would result in preserving all of the available privacy budget for regular events (because the set $L \cup \{t\}$ becomes $\{t\}$), equivalently to event-level protection. -In practice, however, this approach can eventually pose arbitrary privacy risks, especially when dealing with geotagged data. -Particularly, sporadic location data publishing~\cite{gambs2010show, russell2018fitness} or misapplying location cloaking~\cite{xssfopes2020tweet} could result in areas with sparse data points, indicating privacy-sensitive locations. - -% \mk{WIP} -% \kat{write in text and remove the algorithm} -% \begin{algorithm} -% \caption{Adaptive {\thething} privacy mechanism} -% \label{algo:adapt-lmdk-priv} - -% \SetKwInput{KwData}{Input} -% \SetKwInput{KwResult}{Output} - -% \SetKwData{diffCur}{diffCur} -% \SetKwData{diffMin}{diffMin} -% \SetKwData{evalCur}{evalCur} -% \SetKwData{evalOrig}{evalOrig} -% \SetKwData{evalSum}{evalSum} -% \SetKwData{metricCur}{metricCur} -% \SetKwData{metricOrig}{metricOrig} -% \SetKwData{opt}{opt} -% \SetKwData{opti}{opt$_i$} -% \SetKwData{optim}{optim} -% \SetKwData{optimi}{optim$_i$} -% \SetKwData{opts}{opts} -% \SetKwData{reg}{reg} - -% \SetKwData{S}{$S_T$} -% \SetKwData{L}{$L$} -% \SetKwData{epsilon}{$\varepsilon$} - -% \SetKwFunction{calcMetric}{calcMetric} -% \SetKwFunction{evalSeq}{evalSeq} -% \SetKwFunction{getCombs}{getCombs} -% \SetKwFunction{getOpts}{getOpts} - -% \DontPrintSemicolon - -% \KwData{\S, \L, \epsilon} -% \KwResult{\optim} -% \BlankLine - -% % \If{abs($$)} - -% % \If{$i \in L$}{ -% % \lmdks $\leftarrow$ \lmdks + 1 -% % \ForEach{$j \in [i + 1, T]$}{ -% % $varepsilon_j \leftarrow varepsilon_j + \frac{\varepsilon_i}{|T| - \lmdks + 1}$ -% % } -% % } - -% % Evaluate the original -% \metricOrig $\leftarrow$ \calcMetric{$\{t_n\}, \emptyset, \{l_k\}$}\; -% \evalOrig $\leftarrow$ \evalSeq{\metricOrig}\; - -% % Get all possible option combinations -% \opts $\leftarrow$ \getOpts{$\{t_n\}, \{l_k\}$}\; - -% % Track the minimum (best) evaluation -% \diffMin $\leftarrow$ $\infty$\; - -% % Track the optimal sequence (the one with the best evaluation) -% \optim $\leftarrow$ $[]$\; - -% \ForEach{\opt $\in$ \opts}{\label{algo:lmdk-sel-opt-for-each} -% \evalSum $\leftarrow 0$\; -% \ForEach{\opti $\in$ \opt}{ -% \metricCur $\leftarrow$ \calcMetric{$\{t_n\}, \opti, \{l_k\}$}\;\label{algo:lmdk-sel-opt-comparison} -% \evalSum $\leftarrow$ \evalSum $+$ \evalSeq{\metricCur}\; - -% % Compare with current optimal -% \diffCur $\leftarrow \left|\evalSum/\#\opt - \evalOrig\right|$\; -% \If{\diffCur $<$ \diffMin}{ -% \diffMin $\leftarrow$ \diffCur\; -% \optim $\leftarrow$ \opt\; -% } -% } -% }\label{algo:lmdk-sel-opt-end} -% \Return{\optim} -% \end{algorithm} - - -\subsubsection{{\Thething} privacy under temporal correlation} -\label{subsec:correlations} -From the discussion so far, it is evident that for the budget distribution it is not the positions but rather the number of the {\thethings} that matters. -However, this is not the case under the presence of temporal correlation, which is inherent in continuously generated data. - - -% HMMs have two important independence properties: -% Markov hidden process: future depends on past via the present. -% Current observation independent of all else given current state. -% Intuitively, D^t or D^{t+1} "cuts off" theĀ propagationĀ of the Markov chain. -The Hidden Markov Model~\cite{baum1966statistical} stipulates two important independence properties: (i)~the future(past) depends on the past(future) via the present, and (ii)~the current observation is independent of the rest given the current state. -%Thus, the observation of a data release at a timestamp $t$ depends only on the respective input data set $D_t$, i.e.,~the current state. -Hence, there is independence between an observation at a specific timestamp and previous/next data sets under the presence of the current input data set. -Intuitively, knowing the data set at timestamp $t$ stops the propagation of the Markov chain towards the next or previous timestamps. -%\kat{do we see this in the formula 1 ?} -%when calculating the forward or backward privacy loss respectively. - -Cao et al.~\cite{cao2017quantifying} propose a method for computing the total temporal privacy loss $\alpha_t$ at a timestamp $t$ as the sum of the backward and forward privacy loss, $\alpha^B_t$ and $\alpha^F_t$, minus the privacy budget $\varepsilon_t$ -to account for the extra privacy loss due to previous and next releases $\pmb{o}$ of $\mathcal{M}$ under temporal correlation. -By Theorem~\ref{theor:thething-prv}, at every timestamp $t$ we consider the data at $t$ and at the {\thething} timestamps $L$. -%According to the Definitions~{\ref{def:bpl} and \ref{def:fpl}}, we calculate the backward and forward privacy loss by taking into account the privacy budget at previous and next data releases respectively. -When sequentially composing the data releases for each timestamp $i$ in $L \cup \{t\}$ we -%calculate the temporal privacy loss $\alpha_t$ at each timestamp $t \in L \cup \{i\}$ by -%consider the previous and next data releases at the timestamps $i^{-}, i^{+} \in L \cup \{t\} \setminus \{i\}$ respectively. -consider the previous releases in the whole time series until the timestamp $i^{-}$ that is exactly before $i$ in the ordered $L {\cup} \{t\}$, and the next data releases in the whole time series until the timestamp $ i^{+}$ that is exactly after $i$ in the ordered $L {\cup }\{t\} $. -%\kat{not sure I understand} -%Thus, we calculate the backward/forward privacy loss by taking into account the data releases after/before the previous/next data item. -That is: -% \dk{do we keep looking at all Landmarks both for backward and forward? I would assume that for backward we are looking to the Landmarks until the i and for the forward to the Landmarks after the i - if we would like to be consistent with Cao. Otherwise the writing here is confusing.} -% \mk{We are discussing about the case where we calculate the tpl at each timestamp i in L+{t}. Therefore, bpl at i is calculated until i- and fpl at i until i+.} - -\begin{align} - \adjustbox{max width=0.9\linewidth}{ - $\alpha_i = - \underbrace{\ln \frac{\Pr[\pmb{o}_{i^{-} + 1}, \dots, \pmb{o}_i | D_i]}{\Pr[\pmb{o}_{i^{-} + 1}, \dots, \pmb{o}_i | D'_i]}}_{\alpha^B_i} + - \underbrace{\ln \frac{\Pr[\pmb{o}_{i^{+} - 1}, \dots, \pmb{o}_i | D_i]}{\Pr[\pmb{o}_{i^{+} - 1}, \dots, \pmb{o}_i | D'_i]}}_{\alpha^F_i} - - \underbrace{\ln \frac{\Pr[\pmb{o}_i | D_i]}{\Pr[\pmb{o}_i | D'_i]}}_{\varepsilon_i}$ - } -\end{align} - -Finally, $\alpha_t$ is equal to the sum of all $\alpha_i , i\in L \cup\{t\}$. - -% -% where $x_t$ (or $x'_t$) is the potential (neighboring) data item of an individual who is targeted by an adversary with knowledge $\mathbb{D}_t$. -%where $D_t$ and $D'_t$ are the neighboring input data sets (Definition~\ref{def:nb-d-s}) responsible for the output $\pmb{o}_t$. -%Notice that if $t$ is the first or last item in $L \cup \{i\}$ then we need to set $t_{\text{prv}} = 0$ or $t_{\text{nxt}} = \max(T) + 1$. - -%In Section~\ref{sec:eval}, we experimentally show how the distribution of {\thethings} impacts the overall privacy loss of the user.