559 lines
42 KiB
TeX
559 lines
42 KiB
TeX
\chapter{Significant events}
|
||
\label{ch:the-thing}
|
||
|
||
In this chapter, we propose a novel configurable privacy scheme, \emph{\thething} privacy, which takes into account significant events (\emph{\thethings}) in the time series and allocates the available privacy budget accordingly.
|
||
We propose two privacy models that guarantee {\thething} privacy and validate our proposal on real and synthetic data sets.
|
||
|
||
|
||
\section{Motivation}
|
||
\label{sec:lmdk-motiv}
|
||
|
||
The plethora of sensors currently embedded in
|
||
or paired with personal devices and other infrastructures have paved the way for the development of numerous \emph{crowdsensing services} (e.g.,~Google Maps~\cite{gmaps}, Waze~\cite{waze}, etc.) based on the collected personal, and usually geotagged and timestamped data.
|
||
|
||
User--service interactions gather personal event-like data, e.g.,~(\emph{`Bob', `dining', `Canal Saint-Martin', $17{:}00$}).
|
||
When the interactions are performed in a continuous manner, we obtain ~\emph{time series} of events.
|
||
An \emph{event} represents a user--service interaction, registering the information of the individual at a specific time point, i.e.,~a data item that is a pair of an identifying attribute of an individual and the---possibly sensitive---information at a timestamp (including contextual information).
|
||
It can be seen as a correspondence to a record in a database, where each individual may participate once, e.g.,~(`Bob', `dining', `Canal Saint-Martin', $5$).
|
||
Typically, users interact with the services more than once, generating data in a continuous manner (\emph{time series}).
|
||
The services collect and further process the time series in order to give useful feedback to the involved users or to provide valuable insight to various internal/external analytical services.
|
||
Depending on its span, we distinguish the processing into \emph{finite}, when taking place during a predefined time interval, and \emph{infinite}, when taking place in an uninterrupted fashion.
|
||
% Figure~\ref{fig:scenario} shows an example of a finite time series produced by a user (Bob) and composed by $8$ timestamps during his trajectory from his home (\'Elys\'ee) to his work (Louvre) to his hangout (Saint-Martin) and back to his home.
|
||
|
||
\begin{example}
|
||
\label{ex:lmdk-scenario}
|
||
|
||
Consider a finite sequence of spatiotemporal data generated by Bob during an interval of $\ 8$ timestamps, as shown in Figure~\ref{fig:lmdk-scenario}.
|
||
These data are the result of user--LBS interaction while retrieving location-based information or reporting user-state at various locations.
|
||
Events in a shade correspond to privacy-sensitive events that Bob has defined beforehand. For instance his home is around {\'E}lys{\'e}e, his workplace is around the Louvre, and his hangout is around Canal Saint-Martin.
|
||
|
||
\begin{figure}[htp]
|
||
\centering
|
||
\includegraphics[width=\linewidth]{lmdk-scenario}
|
||
\caption{A time series with {\thethings} (highlighted in gray).}
|
||
\label{fig:lmdk-scenario}
|
||
\end{figure}
|
||
|
||
\end{example}
|
||
|
||
The regulation regarding the processing of user-generated data sets~\cite{tankard2016gdpr} requires the provision of privacy guarantees to the users.
|
||
At the same time, it is essential to provide utility metrics to the final consumers of the privacy-preserving process output.
|
||
A widely recognized tool that introduces probabilistic randomness to the original data, while quantifying with a parameter $\varepsilon$ (`privacy budget'~\cite{mcsherry2009privacy}) the privacy/utility ratio is \emph{$\varepsilon$-differential privacy}~\cite{dwork2006calibrating}.
|
||
\emph{Event}, \emph{user}~\cite{dwork2010differential, dwork2010pan}, and \emph{$w$-event}~\cite{kellaris2014differentially} comprise the possible levels of privacy protection.
|
||
Event-level limits the privacy protection to \emph{any single event}, user-level protects \emph{all the events} of any user, and $w$-event provides privacy protection to \emph{any sequence of $w$ events}.
|
||
|
||
The privacy mechanisms for the aforementioned levels assume that in a time series any single event, or any sequence of events, or the entire series of events is equally privacy-significant for the users.
|
||
In reality, this is an simplistic assumption.
|
||
The significance of an event is related to certain user-defined privacy criteria, or to its adjacent events, as well as to the entire time series.
|
||
Identifying \emph{\thething} (significant) events can be done in an automatic or manual way (but is out of scope for this work).
|
||
For example, in spatiotemporal data, \emph{places where an individual spent some time} denote \emph{points of interest} (POIs) (called also stay points)~\cite{zheng2015trajectory}.
|
||
Such data items, and more particularly their spatial attribute values, can be less privacy-sensitive~\cite{primault2018long}, e.g.,~parks, theaters, etc. or, if individuals frequent them, they can reveal supplementary information, e.g.,~residences (home addresses)~\cite{gambs2010show}, places of worship (religious beliefs)~\cite{franceschi-bicchierairussell2015redditor}, etc.
|
||
POIs can be an example of how we can choose {\thethings}, but the idea is not limited to these.
|
||
|
||
\begin{figure}[htp]
|
||
\centering
|
||
\includegraphics[width=\linewidth]{st-cont}
|
||
\caption{User-level and {\thething} $\varepsilon$-differential privacy protection for the time series of Figure~\ref{fig:lmdk-scenario}.}
|
||
\label{fig:st-cont}
|
||
\end{figure}
|
||
|
||
We argue that protecting only {\thething} events along with any regular event release is sufficient for the user's protection, while it improves data utility.
|
||
Take for example the scenario in Figure~\ref{fig:st-cont}, where {\thethings} are highlighted in gray.
|
||
If we want to protect the {\thething} points, we have to allocate at most a budget of $\varepsilon$ to the {\thethings}.
|
||
Notice that the more budget we allocate to an event the less we protect it, but at the same time we maintain its utility.
|
||
In this scenario, event-level protection is not suitable since it can only protect one event at a time.
|
||
Hence, we have to apply user-level privacy protection by distributing equal portions of $\varepsilon$ to all the events, i.e.,~$\frac{\varepsilon}{8}$ to each one (the equivalent of applying $8$-event privacy).
|
||
In this way, we have protected the {\thething} points; we have allocated a total of $\frac{\varepsilon}{2}<\varepsilon$ to the {\thethings}.
|
||
However, perturbing by $\frac{\varepsilon}{8}$ each regular point deteriorates the data utility unnecessarily.
|
||
With {\thething} privacy we propose to distribute the budget taking into account only the existence of the {\thethings} when we release an event of the time series, i.e.,~allocating $\frac{\varepsilon}{5}$ ($4\ \text{\thethings} + 1\ \text{regular point}$) to each event (see Figure~\ref{fig:st-cont}).
|
||
This way, we still guarantee that the {\thethings} are adequately protected, as they receive a total budget of $\frac{4\varepsilon}{5}<\varepsilon$.
|
||
At the same time, we avoid over-perturbing the regular events, as we allocate to them a higher total budget ($\frac{4\varepsilon}{5}$) than in user-level ($\frac{\varepsilon}{2}$), and thus less noise.
|
||
|
||
|
||
\section{Contribution}
|
||
\label{sec:lmdk-contrib}
|
||
|
||
In this chapter, we formally define a novel privacy notion that we call \emph{{\thething} privacy}.
|
||
We apply this privacy notion to time series consisting of \emph{{\thethings}} and regular events, and we design and implement three {\thething} privacy mechanisms.
|
||
We further study {\thething} privacy under temporal correlation that is inherent in time series publishing.
|
||
Finally, we evaluate {\thething} privacy with real and synthetic data sets, in settings with or without temporal correlation, showcasing the validity of our model.
|
||
|
||
|
||
\section{{\Thething} privacy}
|
||
\label{sec:prob}
|
||
|
||
{\Thething} privacy is based on differential privacy.
|
||
For this reason, we revisit the definition and important properties of differential privacy before moving on to the main ideas of this paper.
|
||
Although, its local variant~\cite{duchi2013local} is more compatible with microdata, which is our use case, for the shake of simplicity we stick to the original version of differential privacy.
|
||
We refer the interested reader to~\cite{desfontaines2020sok} for a systematic taxonomy of the different variants and extensions of differential privacy, to~\cite{katsomallos2019privacy} for a survey of privacy models for continuous data publishing, and to~\cite{primault2018long} for an organization of the recent contributions in location privacy.
|
||
|
||
|
||
\subsection{Differential privacy}
|
||
\label{subsec:dp}
|
||
|
||
\emph{Differential privacy}~\cite{dwork2006calibrating} is a property of a privacy mechanism $\mathcal{M}$ processing a set of \emph{privacy-sensitive} personal data $D$,
|
||
%from a domain $\mathcal{D}$,
|
||
while providing quantifiable privacy and utility guarantees.
|
||
More specifically, $\mathcal{M}$ satisfies $\varepsilon$-differential privacy for a given `privacy budget' $\varepsilon \in \mathbb{R^+}$, if the ratio of the probabilities of $D$ and $D'$ being true worlds is lower or equal to $e^\varepsilon$, where $D'$ differs in one tuple from $D$.
|
||
%cannot decide sure that a tuple exists in the database or not. for every pair of data sets $D, D' $
|
||
% \in \mathcal{D}$
|
||
%, as defined in Definition~\ref{def:dp}.
|
||
|
||
|
||
%\begin{definition}
|
||
% [Differential privacy~\cite{dwork2006calibrating}]
|
||
% \label{def:dp}
|
||
% A privacy mechanism $\mathcal{M}$ with domain $\mathcal{D}$ and range $\mathcal{O}$, satisfies $\varepsilon$-differential privacy, for a given privacy budget $\varepsilon \in \mathbb{R}+$, if for every pair of data sets $D, D' $
|
||
% % \in \mathcal{D}$
|
||
% differing in one tuple and all sets $O$ it holds that:
|
||
% %\subseteq \mathcal{O}$:
|
||
% $$\Pr[\mathcal{M}(D) \in O] \leq e^\varepsilon \Pr[\mathcal{M}(D') \in O]$$
|
||
%\end{definition}
|
||
%
|
||
|
||
A widely used privacy mechanism is the \emph{Laplace}
|
||
% \kat{add exponential, maybe put references only}
|
||
~\cite{dwork2014algorithmic}, which draws randomly a value from the probability distribution of $\textrm{Laplace}(\mu, \frac{\Delta f}{\varepsilon})$.
|
||
$\mu$ stands for the original output of the query associated to $\mathcal{M}$ with sensitivity $\Delta f$, and $\frac{\Delta f}{\varepsilon}$ is the scale of the distribution.
|
||
% \kat{do we need to know the details of the mechanisms? I would rather put an intuitive description, or in which cases each is preferred.}
|
||
% \mk{Most probably we'll need the exponential and Geo-I.}
|
||
%A typical example of differential privacy mechanism is the \emph{Laplace mechanism}~\cite{dwork2014algorithmic}.
|
||
%It draws randomly a value from the probability distribution of $\textrm{Laplace}(\mu, b)$, where $\mu$ stands for the location parameter and $b > 0$ the scale parameter.
|
||
%Here, $\mu$ is equal to the original output value of a query function, and $b$ is the sensitivity of the query function divided by $\varepsilon$.
|
||
%The Laplace mechanism works for any function with range the set of real numbers.
|
||
A specialization of this mechanism for location data is the \emph{Planar Laplace mechanism}~\cite{andres2013geo}, which is based on a multivariate Laplace distribution and offers a level of protection equal to $\varepsilon$ times a desired protection radius.
|
||
|
||
Mechanisms that satisfy differential privacy are \emph{composable}.
|
||
%, i.e.,~the combination of their results satisfies differential privacy as well.
|
||
The \emph{sequential} composition of $\mathcal{M}_1(D)$, $\mathcal{M}_2(D)$ with $\varepsilon_1$, $\varepsilon_2$
|
||
%applied on the same $D$
|
||
results in $\mathcal{M}(D)$ with $\varepsilon = \varepsilon_1 + \varepsilon_2$.
|
||
% The \emph{parallel} composition of $\mathcal{M}_1(D_1),\mathcal{M}_2(D_2)$ with $\varepsilon_1,\varepsilon_2$ results in $\mathcal{M}(D_1\cup D_2)$ with $\varepsilon{=}max(\varepsilon_1,\varepsilon_2)$. The post-processing of the results of a differential private mechanism does not deteriorate the entailed privacy.
|
||
% \mk{We don't need it now}
|
||
%
|
||
%\kat{integrate with next subsection}
|
||
%The presence of temporal correlations might result into additional privacy loss due to data releases performed in previous -- \emph{backward privacy loss} $\alpha^B$ -- and subsequent --\emph{forward privacy loss} $\alpha^F$~\cite{cao2017quantifying} -- timestamps.\kat{review -- complete further if space}
|
||
%Cao et al.~\cite{cao2017quantifying} propose a method for computing the total temporal privacy loss (TPL) in the presence of temporal correlations and background knowledge. Due to the lack of space, we refer the interested reader to the original publication for the complete definitions and formulas.
|
||
|
||
|
||
\subsection{Problem description and definition}
|
||
\label{subsec:prob-set}
|
||
|
||
%\kat{move flowchart here}
|
||
|
||
%Our problem setting consists of three entities: (i) data generators (users), (ii) data publishers (trusted non-adversarial entities), and (iii) data consumers (possibly adversarial entities).
|
||
Users generate sensitive data, which are processed in a secure and private way by a trusted curator and are later published in order to be consumed by potentially adversarial data analysts.
|
||
%The data unit produced by the users is an \emph{event}, i.e., a piece of timestamped user-related information.\kat{should we say geo-stamped?}.
|
||
Data are produced as a series of events, which we call time series.
|
||
An \emph{event} is defined as a triple of an identifying attribute of an individual and the possibly sensitive data at a timestamp.
|
||
%This workflow is repeated in a continuous manner, producing series of events, which we call time series.
|
||
%, producing, processing, publishing, and consuming events in a private manner.
|
||
%\kat{keep only the terms with a small description.}
|
||
%\begin{enumerate}[(i)]
|
||
%
|
||
% \item \textbf{Data generators} (users) entity $E_g$ interacts with a crowdsensing application and produces continuously privacy-sensitive data items in an arbitrary frequency during the application's usage period $T = (t)_{t \in \mathbb{N}}$.
|
||
% Thus, at each timestamp $t$, $E_g$ generates a data set $D_t \in \mathcal{D}$ where each of its members contributes a single data item.
|
||
%
|
||
% \item \textbf{Data publishers} (trusted non-adversarial) entity $E_p$ receives the data sent by $E_g$ in the form of a series of events in $T$.
|
||
% Following the \emph{global} processing and publishing scheme, $E_p$ collects at $t$ a data set $D_t$ and privacy-protects it by applying the respective privacy mechanism $\mathcal{M}_t$.
|
||
% $\mathcal{M}_t$ uses independent randomness such that it satisfies $\varepsilon_t$-differential privacy.
|
||
%
|
||
% \item \textbf{Data consumers} (possibly adversarial) entity $E_c$ receives the result $\mathbf{o}_t$ of the privacy-preserving processing of $D_t$ by $E_p$.
|
||
% According to Theorem~\ref{theor:compo-seq-ind}, the overall privacy guarantee of the outputs of $\mathcal{M}$ is equal to the sum of all the privacy budgets of the respective privacy mechanisms that compose $\mathcal{M}$, i.e.,~$\sum_{t \in T}\varepsilon_t$.
|
||
%
|
||
%\end{enumerate}
|
||
%
|
||
%We assume that all the interactions between $E_g$ and $E_p$ are secure and private, and thus $E_p$ is considered trusted and non-adversarial by $E_g$.
|
||
%Notice that, in a real life scenario, $E_g$ and $E_c$ might overlap with each other, i.e.,~data producers might be data consumers as well.
|
||
%
|
||
%
|
||
%\subsection{Privacy goal}
|
||
%\label{subsec:prv-g}
|
||
|
||
We argue that in continuous user-generated data publishing, events are not equally `significant' in terms of privacy.
|
||
% (including contextual information).
|
||
%It can be seen as a correspondence to a record in a database, where each individual may participate once.
|
||
We term a significant event---according to user- or data-related criteria---as a \emph{\thething}~event.
|
||
% As mentioned in Section~\ref{sec:intro}, t
|
||
The identification of {\thething} events can be performed manually or automatically~\cite{zhou2004discovering, hariharan2004project}, and is an orthogonal problem to this current work.
|
||
%We defer the study of the {\thethings} discovery to a following work.
|
||
In this work, we consider the {\thething} timestamps non-sensitive and provided by the user as input along with the privacy budget $\varepsilon$.
|
||
% \kat{check that this is mentioned in the intro}
|
||
For example, events $p_1$, $p_3$, $p_5$, $p_8$ in Figure~\ref{fig:scenario} are {\thething} events.
|
||
% relevant to certain user-defined privacy criteria, or to its adjacent data item(s) as well as to the entire data set or parts thereof.
|
||
|
||
% A significant event or item signals its consequence to us, toward us.
|
||
% https://www.quora.com/What-is-the-difference-between-significant-and-important
|
||
%\begin{definition}
|
||
% [{\Thething} event]
|
||
% \label{def:thething-evnt}
|
||
% A {\thething} event is a significant---according to user- or data-related criteria---user-generated data item.
|
||
%\end{definition}
|
||
|
||
|
||
|
||
%In this scenario, these are $1$, $3$, $5$, $8$ and they fall in areas in a dark shade.
|
||
%\kat{we must define a series of events before neighbouring series of events}
|
||
|
||
%\begin{definition}
|
||
% [{\Thething} neighboring time series]
|
||
% \label{def:thething-nb}
|
||
% Two time series are {\thething} neighboring (or adjacent) when they differ by a single {\thething} event.
|
||
%% i.e.,~one can be obtained by adding/removing a {\thething} to/from the other.
|
||
%\end{definition}
|
||
|
||
Two time series of equal lengths are \emph{{\thething} neighboring} when they differ by a single {\thething} event.
|
||
For example, the time series ($p_1$, \dots, $p_8$) with {\thethings} set the $\{p_1, p_3,p_5\}$ is {\thething} neighboring to the time series of Figure~\ref{fig:scenario}.
|
||
%This means that we can obtain the first time series by adding/removing one event to/from the second time series.
|
||
%to/from any one of two {\thething} neighboring series of events we can obtain the other series.
|
||
% Therefore, Corollary~\ref{cor:thething-nb} follows.
|
||
|
||
We proceed to propose \emph{{\thething} privacy}, a configurable variation of differential privacy for time series (Definition~\ref{def:thething-prv}).
|
||
|
||
%\begin{corollary}
|
||
% \label{cor:thething-nb}
|
||
% Two {\thething} neighboring series of events are event neighboring as well.
|
||
%\end{corollary}
|
||
%\kat{what is event neighboring?}
|
||
|
||
%\kat{Up to now M was a mechanism, now it is a set of mechanisms?}
|
||
\begin{definition}
|
||
% [{\Thething} privacy]
|
||
\label{def:thething-prv}
|
||
Let $\mathcal{M}$ be a privacy mechanism with range $\mathcal{O}$ and domain $\mathcal{S}_T$ being the set of all time series with length $|T|$, where $T$ is a sequence of timestamps.
|
||
$\mathcal{M}$ satisfies {\thething} $\varepsilon$-differential privacy (or, simply, {\thething} privacy) if for all sets $O \subseteq \mathcal{O}$, and for every pair of {\thething}-neighboring time series $S_T$, $S_T'$,
|
||
% and all $T = (t)_{t \in \mathbb{N}}$,
|
||
it holds that
|
||
$$Pr[\mathcal{M}(S_T) \in O] \leq e^\varepsilon Pr[\mathcal{M}(S_T') \in O]$$
|
||
\end{definition}
|
||
|
||
% \kat{to rephrase for an easier transition -- mention here user and event level that satisfy {\thething} privacy and add discussion that we can do better and propose the new mechanism}
|
||
As discussed in Section~\ref{sec:intro}, user-level privacy can achieve {\thething} privacy, but it over-perturbs the final data by not distinguishing into {\thething} and regular events.
|
||
Theorem~\ref{theor:thething-prv} proposes how to achieve the desired privacy for the {\thethings} (i.e.,~a total budget lower than $\varepsilon$), and in the same time provide better quality overall.
|
||
|
||
% the existing protection levels of differential privacy do not provide adequate control in time series publishing.
|
||
%In Figure~\ref{fig:st-cont} we exhibited how additional user preferences can impact the necessary privacy/utility tradeoff.
|
||
%We introduce the notion of {\thethings} and propose {\thething} privacy, a configurable variation of differential privacy for time series.
|
||
%By taking into account {\thethings}, i.e.,~timestamps were significant events take place, {\thething} privacy can provide a satisfying protection level while not perturbing the original time series unnecessarily.
|
||
|
||
%In mereology, the formal study on the relation between parts and the entities they form, it is generally held that the identity of an observable object depends on its \emph{spatiotemporal continuity}~\cite{wiggins1967identity, scaltsas1981identity, hazarika2001qualitative}: the property of well-behaved objects that alter their state in harmony with space and time.
|
||
%Considering events that span the entirety of the user-generated time series ensures the spatiotemporal continuity of the users.
|
||
%This way, it is possible to acquire more information regarding individuals' identities, and design privacy preserving methods that offer improved privacy and utility guarantees.
|
||
|
||
\begin{theorem}
|
||
% [{\Thething} privacy]
|
||
\label{theor:thething-prv}
|
||
% A privacy mechanism that protects any timestamp all the {\thething} events in a time series, satisfies {\thething} privacy.
|
||
Let $\mathcal{M}$ be a mechanism with input a time series $S_T$, where $T$ is the set of the involved timestamps, and $L \subseteq T$ be the set of {\thething} timestamps.
|
||
$\mathcal{M}$ is decomposed to $\varepsilon$-differential private sub-mechanisms $\mathcal{M}_t$, for every $t \in T$, that apply independent randomness to the data item at $t$.
|
||
Then, given a privacy budget $\varepsilon$, $\mathcal{M}$ satisfies {\thething} privacy if for every $t$ it holds that
|
||
$$ \sum_{i\in L \cup \{t\}} \varepsilon_i \leq \varepsilon$$
|
||
\end{theorem}
|
||
% \mk{To discuss.}
|
||
|
||
Due to space constraints, we omit the proof of Theorem~\ref{theor:thething-prv} and defer it for a longer version of this paper.
|
||
|
||
\subsubsection{{\Thething} privacy mechanisms}
|
||
\label{subsec:lmdk-mechs}
|
||
% \kat{add the two models -- uniform and dynamic and skip}
|
||
|
||
%\kat{isn't the uniform distribution a method? there is a section for the methods. }
|
||
Figure~\ref{fig:st-cont} shows the simplest model that implements Theorem~\ref{theor:thething-prv}, the \textbf{Uniform} distribution of privacy budget $\varepsilon$ for {\thething} privacy.
|
||
% \mk{We capitalize the first letter because it's the name of the method.}
|
||
% in comparison with user-level protection.
|
||
In this case, it is enough to distribute at each timestamp the total privacy budget divided by the number of timestamps corresponding to {\thethings}, plus one if we are releasing a regular timestamp.
|
||
Consequently, at each timestamp we protect every {\thething}, while reserving a part of $\varepsilon$ for the current timestamp.
|
||
%In this case, distributing $\frac{\varepsilon}{5}$ can guarantee {\thething} privacy.
|
||
|
||
|
||
|
||
% \begin{figure}[htp]
|
||
% \centering
|
||
% \includegraphics[width=0.9\linewidth]{thething-prv}
|
||
% \caption{Uniform application scenario of {\thething} privacy.}
|
||
% \label{fig:thething-prv}
|
||
% \end{figure}
|
||
|
||
Next, we propose an \textbf{Adaptive} privacy mechanism taking into account changes in the input data and exploiting the post-processing property of differential privacy.
|
||
Initially, it reserves uniformly the available privacy budget for each future release.
|
||
At each timestamp, based on a sampling rate the mechanism either publishes with noise the original data or it releases an approximation based on previous releases.
|
||
In the case when it publishes with noise the original data, it also calculates the difference between the current and the previous release and compares the difference with the scale of the perturbation ($\frac{\Delta f}{\varepsilon}$).
|
||
The outcome of this comparison determines the adaptation of the sampling rate for the next events:
|
||
if the scale is greater it means that the input has not changed much, and therefore it must decrease the sampling rate.
|
||
In the case when the mechanism approximates a {\thething} (but not a regular timestamp), it distributes the reserved privacy budget
|
||
% divided by the number of remaining {\thething} plus one
|
||
to the next timestamps.
|
||
|
||
% Why skipping publications is problematic?
|
||
One might argue that we could \textbf{Skip} the \thething\ data releases.
|
||
% and limit the number of {\thethings}.
|
||
This would result in preserving all of the available privacy budget for regular events (because the set $L \cup \{t\}$ becomes $\{t\}$), equivalently to event-level protection.
|
||
In practice, however, this approach can eventually pose arbitrary privacy risks, especially when dealing with geotagged data.
|
||
Particularly, sporadic location data publishing~\cite{gambs2010show, russell2018fitness} or misapplying location cloaking~\cite{xssfopes2020tweet} could result in areas with sparse data points, indicating privacy-sensitive locations.
|
||
|
||
% \mk{WIP}
|
||
% \kat{write in text and remove the algorithm}
|
||
% \begin{algorithm}
|
||
% \caption{Adaptive {\thething} privacy mechanism}
|
||
% \label{algo:adapt-lmdk-priv}
|
||
|
||
% \SetKwInput{KwData}{Input}
|
||
% \SetKwInput{KwResult}{Output}
|
||
|
||
% \SetKwData{diffCur}{diffCur}
|
||
% \SetKwData{diffMin}{diffMin}
|
||
% \SetKwData{evalCur}{evalCur}
|
||
% \SetKwData{evalOrig}{evalOrig}
|
||
% \SetKwData{evalSum}{evalSum}
|
||
% \SetKwData{metricCur}{metricCur}
|
||
% \SetKwData{metricOrig}{metricOrig}
|
||
% \SetKwData{opt}{opt}
|
||
% \SetKwData{opti}{opt$_i$}
|
||
% \SetKwData{optim}{optim}
|
||
% \SetKwData{optimi}{optim$_i$}
|
||
% \SetKwData{opts}{opts}
|
||
% \SetKwData{reg}{reg}
|
||
|
||
% \SetKwData{S}{$S_T$}
|
||
% \SetKwData{L}{$L$}
|
||
% \SetKwData{epsilon}{$\varepsilon$}
|
||
|
||
% \SetKwFunction{calcMetric}{calcMetric}
|
||
% \SetKwFunction{evalSeq}{evalSeq}
|
||
% \SetKwFunction{getCombs}{getCombs}
|
||
% \SetKwFunction{getOpts}{getOpts}
|
||
|
||
% \DontPrintSemicolon
|
||
|
||
% \KwData{\S, \L, \epsilon}
|
||
% \KwResult{\optim}
|
||
% \BlankLine
|
||
|
||
% % \If{abs($$)}
|
||
|
||
% % \If{$i \in L$}{
|
||
% % \lmdks $\leftarrow$ \lmdks + 1
|
||
% % \ForEach{$j \in [i + 1, T]$}{
|
||
% % $varepsilon_j \leftarrow varepsilon_j + \frac{\varepsilon_i}{|T| - \lmdks + 1}$
|
||
% % }
|
||
% % }
|
||
|
||
% % Evaluate the original
|
||
% \metricOrig $\leftarrow$ \calcMetric{$\{t_n\}, \emptyset, \{l_k\}$}\;
|
||
% \evalOrig $\leftarrow$ \evalSeq{\metricOrig}\;
|
||
|
||
% % Get all possible option combinations
|
||
% \opts $\leftarrow$ \getOpts{$\{t_n\}, \{l_k\}$}\;
|
||
|
||
% % Track the minimum (best) evaluation
|
||
% \diffMin $\leftarrow$ $\infty$\;
|
||
|
||
% % Track the optimal sequence (the one with the best evaluation)
|
||
% \optim $\leftarrow$ $[]$\;
|
||
|
||
% \ForEach{\opt $\in$ \opts}{\label{algo:lmdk-sel-opt-for-each}
|
||
% \evalSum $\leftarrow 0$\;
|
||
% \ForEach{\opti $\in$ \opt}{
|
||
% \metricCur $\leftarrow$ \calcMetric{$\{t_n\}, \opti, \{l_k\}$}\;\label{algo:lmdk-sel-opt-comparison}
|
||
% \evalSum $\leftarrow$ \evalSum $+$ \evalSeq{\metricCur}\;
|
||
|
||
% % Compare with current optimal
|
||
% \diffCur $\leftarrow \left|\evalSum/\#\opt - \evalOrig\right|$\;
|
||
% \If{\diffCur $<$ \diffMin}{
|
||
% \diffMin $\leftarrow$ \diffCur\;
|
||
% \optim $\leftarrow$ \opt\;
|
||
% }
|
||
% }
|
||
% }\label{algo:lmdk-sel-opt-end}
|
||
% \Return{\optim}
|
||
% \end{algorithm}
|
||
|
||
|
||
\subsubsection{{\Thething} privacy under temporal correlation}
|
||
\label{subsec:correlations}
|
||
From the discussion so far, it is evident that for the budget distribution it is not the positions but rather the number of the {\thethings} that matters.
|
||
However, this is not the case under the presence of temporal correlation, which is inherent in continuously generated data.
|
||
|
||
|
||
% HMMs have two important independence properties:
|
||
% Markov hidden process: future depends on past via the present.
|
||
% Current observation independent of all else given current state.
|
||
% Intuitively, D^t or D^{t+1} "cuts off" the propagation of the Markov chain.
|
||
The Hidden Markov Model~\cite{baum1966statistical} stipulates two important independence properties: (i)~the future(past) depends on the past(future) via the present, and (ii)~the current observation is independent of the rest given the current state.
|
||
%Thus, the observation of a data release at a timestamp $t$ depends only on the respective input data set $D_t$, i.e.,~the current state.
|
||
Hence, there is independence between an observation at a specific timestamp and previous/next data sets under the presence of the current input data set.
|
||
Intuitively, knowing the data set at timestamp $t$ stops the propagation of the Markov chain towards the next or previous timestamps.
|
||
%\kat{do we see this in the formula 1 ?}
|
||
%when calculating the forward or backward privacy loss respectively.
|
||
|
||
Cao et al.~\cite{cao2017quantifying} propose a method for computing the total temporal privacy loss $\alpha_t$ at a timestamp $t$ as the sum of the backward and forward privacy loss, $\alpha^B_t$ and $\alpha^F_t$, minus the privacy budget $\varepsilon_t$
|
||
to account for the extra privacy loss due to previous and next releases $\pmb{o}$ of $\mathcal{M}$ under temporal correlation.
|
||
By Theorem~\ref{theor:thething-prv}, at every timestamp $t$ we consider the data at $t$ and at the {\thething} timestamps $L$.
|
||
%According to the Definitions~{\ref{def:bpl} and \ref{def:fpl}}, we calculate the backward and forward privacy loss by taking into account the privacy budget at previous and next data releases respectively.
|
||
When sequentially composing the data releases for each timestamp $i$ in $L \cup \{t\}$ we
|
||
%calculate the temporal privacy loss $\alpha_t$ at each timestamp $t \in L \cup \{i\}$ by
|
||
%consider the previous and next data releases at the timestamps $i^{-}, i^{+} \in L \cup \{t\} \setminus \{i\}$ respectively.
|
||
consider the previous releases in the whole time series until the timestamp $i^{-}$ that is exactly before $i$ in the ordered $L {\cup} \{t\}$, and the next data releases in the whole time series until the timestamp $ i^{+}$ that is exactly after $i$ in the ordered $L {\cup }\{t\} $.
|
||
%\kat{not sure I understand}
|
||
%Thus, we calculate the backward/forward privacy loss by taking into account the data releases after/before the previous/next data item.
|
||
That is:
|
||
% \dk{do we keep looking at all Landmarks both for backward and forward? I would assume that for backward we are looking to the Landmarks until the i and for the forward to the Landmarks after the i - if we would like to be consistent with Cao. Otherwise the writing here is confusing.}
|
||
% \mk{We are discussing about the case where we calculate the tpl at each timestamp i in L+{t}. Therefore, bpl at i is calculated until i- and fpl at i until i+.}
|
||
|
||
\begin{align}
|
||
\adjustbox{max width=0.9\linewidth}{
|
||
$\alpha_i =
|
||
\underbrace{\ln \frac{\Pr[\pmb{o}_{i^{-} + 1}, \dots, \pmb{o}_i | D_i]}{\Pr[\pmb{o}_{i^{-} + 1}, \dots, \pmb{o}_i | D'_i]}}_{\alpha^B_i} +
|
||
\underbrace{\ln \frac{\Pr[\pmb{o}_{i^{+} - 1}, \dots, \pmb{o}_i | D_i]}{\Pr[\pmb{o}_{i^{+} - 1}, \dots, \pmb{o}_i | D'_i]}}_{\alpha^F_i} -
|
||
\underbrace{\ln \frac{\Pr[\pmb{o}_i | D_i]}{\Pr[\pmb{o}_i | D'_i]}}_{\varepsilon_i}$
|
||
}
|
||
\end{align}
|
||
|
||
Finally, $\alpha_t$ is equal to the sum of all $\alpha_i , i\in L \cup\{t\}$.
|
||
|
||
%
|
||
% where $x_t$ (or $x'_t$) is the potential (neighboring) data item of an individual who is targeted by an adversary with knowledge $\mathbb{D}_t$.
|
||
%where $D_t$ and $D'_t$ are the neighboring input data sets (Definition~\ref{def:nb-d-s}) responsible for the output $\pmb{o}_t$.
|
||
%Notice that if $t$ is the first or last item in $L \cup \{i\}$ then we need to set $t_{\text{prv}} = 0$ or $t_{\text{nxt}} = \max(T) + 1$.
|
||
|
||
%In Section~\ref{sec:eval}, we experimentally show how the distribution of {\thethings} impacts the overall privacy loss of the user.
|
||
|
||
|
||
|
||
\section{Evaluation}
|
||
\label{sec:the-thing-eval}
|
||
|
||
In this section we present the experiments that we performed on real and synthetic data sets.
|
||
With the experiments on the synthetic data sets we show the privacy loss by our framework when tuning the size and statistical characteristics of the input {\thething} set $L$.
|
||
We also show how the privacy loss under temporal correlation is affected by the number and distribution of the {\thethings}.
|
||
With the experiments on the real data sets, we show the performance in terms of utility of our three {\thething} mechanisms.
|
||
|
||
Notice that in our experiments, in the cases when we have $0\%$ and $100\%$ of the events being {\thethings}, we get the same behavior as in event- and user-level privacy respectively.
|
||
This happens due the fact that at each timestamp we take into account only the data items at the current timestamp and ignore the rest of the time series (event-level) when there are no {\thethings}.
|
||
Whereas, when each timestamp corresponds to a {\thething} we consider and protect all the events throughout the entire series (user-level).
|
||
|
||
|
||
\subsection{Setting, configurations, and data sets}
|
||
\paragraph{Setting}
|
||
We implemented our experiments\footnote{Code available at \url{https://gitlab.com/adhesivegoldfinch/cikm}} in Python $3$.$9$.$5$ and executed them on a machine with Intel i$7$-$6700$HQ $3$.$5$GHz CPU and $16$GB RAM, running Manjaro $21$.$0$.$5$.
|
||
We repeated each experiment $100$ times and we report the mean over these iterations.
|
||
|
||
|
||
\paragraph{Data sets}
|
||
For the \emph{real} data sets, we used the Geolife~\cite{zheng2010geolife} and T-drive~\cite{yuan2010t} from which we sampled the first $1000$ data items.
|
||
We achieved the desired {\thethings} percentages by utilizing the method of Li et al.~\cite{li2008mining} for detecting stay points in trajectory data.
|
||
In more detail, the algorithm checks for each data item if each subsequent item is within a given distance threshold $\Delta l$ and measures the time period $\Delta t$ between the present point and the last subsequent point.
|
||
We achieve $0$, $20$ $40$, $60$, $80$, and $100$ {\thethings} percentages by setting the ($\Delta l$ in meters, $\Delta t$ in minutes) pairs input to the stay point discovery method for T-drive as [($0$, $1000$), ($2095$, $30$), ($2790$, $30$), ($3590$, $30$), ($4825$, $30$), ($10350$, $30$)] and for Geolife as [($0$, $100000$), ($205$, $30$), ($450$, $30$), ($725$, $30$), ($855$, $30$), ($50000$, $30$)].
|
||
|
||
|
||
Next, we generated synthetic time series of length equal to $100$ timestamps, for which we varied the number and distribution of {\thethings}.
|
||
% to achieve the necessary {\thethings} distribution and percentage for where applicable.
|
||
% \paragraph{{\Thethings} distribution}
|
||
We created \emph{left-skewed} (the {\thethings} are distributed towards the end), \emph{symmetric} (in the middle), \emph{right-skewed} (in the beginning), \emph{bimodal} (both end and beginning), and \emph{uniform} (all over the time series) {\thething} distributions.
|
||
%, in the beginning and in the end (\emph{bimodal}), and all over the extend (\emph{uniform}) of a time series.
|
||
When pertinent, we group the left- and right-skewed cases as simply `skewed', since they share several features due to symmetry.
|
||
In order to get {\thethings} with the above distribution features, we generate probability distributions with appropriate characteristics and sample from them, without replacement, the desired number of points.
|
||
%The generated distributions are representative of the cases that we wish to examine during the experiments.
|
||
% For example, for a left-skewed {\thethings} distribution we would utilize a truncated distribution resulting from the restriction of the domain of a normal distribution to the beginning and end of the time series with its location shifted to the center of the right half of the series.
|
||
For consistency, we calculate the scale parameter depending on the length of the series by setting it equal to the series' length over a constant.
|
||
%We take into account only the temporal order of the points and the position of regular and {\thething} events within the series.
|
||
Note, that for the experiments performed on the synthetic data sets, the original values to be released do not influence the outcome of our conclusions, thus we ignore them.
|
||
|
||
|
||
\paragraph{Configurations}
|
||
We model the temporal correlation in the synthetic data as a \emph{stochastic matrix} $P$, using a \emph{Markov Chain}~\cite{gagniuc2017markov}.
|
||
$P$ is a $n \times n$ matrix, where the element $p_{ij}$
|
||
%at the $i$th row of the $j$th column that
|
||
represents the transition probability from a state $i$ to another state $j$.
|
||
%, $\forall i, j \leq n$.
|
||
It holds that the elements of every row $j$ of $P$ sum up to $1$.
|
||
We follow the \emph{Laplacian smoothing} technique~\cite{sorkine2004laplacian} as utilized in~\cite{cao2018quantifying} to generate the matrix $P$ with a degree of temporal correlation $s>0$.
|
||
% and generate a stochastic matrix $P$ with a degree of temporal correlation $s$ by calculating each element $P_{ij}$ as follows
|
||
%$$\frac{(I_{n})_{ij} + s}{\sum_{k = 1}^{n}((I_{n})_{jk} + s)}$$
|
||
%where $I_{n}$ is an \emph{identity matrix} of size $n$.
|
||
%, i.e.,~an $n \times n$ matrix with $1$s on its main diagonal and $0$s elsewhere.
|
||
% $s$ takes only positive values which are comparable only for stochastic matrices of the same size.
|
||
$s$ dictates the strength of the correlation; the lower its value,
|
||
%the lower the degree of uniformity of each row, and therefore
|
||
the stronger the correlation degree.
|
||
%In general, larger transition matrices tend to be uniform, resulting in weaker correlation.
|
||
In our experiments, for simplicity, we set $n = 2$ and we investigate the effect of \emph{weak} ($s = 1$), \emph{moderate} ($s = 0.1$), and \emph{strong} ($s = 0.01$) temporal correlation degree on the overall privacy loss.
|
||
|
||
We set $\varepsilon = 1$.
|
||
To perturb the spatial values of the real data sets, we inject noise that we sample from the Planar Laplace mechanism~\cite{andres2013geo}.
|
||
Finally, notice that all diagrams are in logarithmic scale.
|
||
|
||
\subsection{Experiments}
|
||
|
||
\paragraph{Budget allocation schemes}
|
||
|
||
Figure~\ref{fig:real} exhibits the performance of the three mechanisms: Skip, Uniform, and Adaptive.
|
||
|
||
\begin{figure}[htp]
|
||
\centering
|
||
\subcaptionbox{Geolife\label{fig:geolife}}{%
|
||
\includegraphics[width=.5\linewidth]{geolife}%
|
||
}%
|
||
\subcaptionbox{T-drive\label{fig:t-drive}}{%
|
||
\includegraphics[width=.5\linewidth]{t-drive}%
|
||
}%
|
||
\caption{The mean absolute error (in meters) of the released data for different {\thethings} percentages.}
|
||
\label{fig:real}
|
||
\end{figure}
|
||
|
||
For the Geolife data set (Figure~\ref{fig:geolife}), Skip has the best performance (measured in Mean Absolute Error, in meters) because it invests the most budget overall at every regular event, by approximating the {\thething} data based on previous releases.
|
||
Due to the data set's high density (every $1$--$5$ seconds or every $5$--$10$ meters per point) approximating constantly has a low impact on the data utility.
|
||
On the contrary, the lower density of the T-drive data set (Figure~\ref{fig:t-drive}) has a negative impact on the performance of Skip.
|
||
In the T-drive data set, the Adaptive mechanism outperforms the Uniform one by $10$\%--$20$\% for all {\thethings} percentages greater than $0$ and by more than $20$\% the Skip one.
|
||
In general, we can claim that the Adaptive is the best performing mechanism, if we take into consideration the drawbacks of the Skip mechanism mentioned in Section~\ref{subsec:lmdk-mechs}. Moreover, designing a data-dependent sampling scheme would possibly result in better results for Adaptive.
|
||
|
||
|
||
\paragraph{Temporal distance and correlation}
|
||
Figure~\ref{fig:avg-dist} shows a comparison of the average temporal distance of the events from the previous/next {\thething} or the start/end of the time series for various distributions in synthetic data.
|
||
More particularly, we count for every event the total number of events between itself and the nearest {\thething} or the series edge.
|
||
We observe that the uniform and bimodal distributions tend to limit the regular event--{\thething} distance.
|
||
This is due to the fact that the former scatters the {\thethings}, while the latter distributes them on both edges, leaving a shorter space uninterrupted by {\thethings}.
|
||
% and as a result they reduce the uninterrupted space by landmarks in the sequence.
|
||
On the contrary, distributing the {\thethings} at one part of the sequence, as in skewed or symmetric, creates a wider space without {\thethings}.
|
||
|
||
\begin{figure}[htp]
|
||
\centering
|
||
\includegraphics[width=.5\linewidth]{avg-dist}%
|
||
\caption{Average temporal distance of the events from the {\thethings} for different {\thethings} percentages within a time series in various {\thethings} distributions.}
|
||
\label{fig:avg-dist}
|
||
\end{figure}
|
||
|
||
Figure~\ref{fig:dist-cor} illustrates a comparison among the aforementioned distributions regarding the overall privacy loss under moderate (Figure~\ref{fig:dist-cor-mod}), and strong (Figure~\ref{fig:dist-cor-stg}) correlation degrees.
|
||
The line shows the overall privacy loss---for all cases of {\thethings} distribution---without temporal correlation.
|
||
We skip the presentation of the results under a weak correlation degree, since they converge in this case.
|
||
In combination with Figure~\ref{fig:avg-dist}, we conclude that a greater average event-{\thething} distance in a distribution can result into greater overall privacy loss under moderate and strong temporal correlation.
|
||
This is due to the fact that the backward/forward privacy loss accumulates more over time in wider spaces without {\thethings} (see Section~\ref{subsec:correlations}).
|
||
Furthermore, the behavior of the privacy loss is as expected regarding the temporal correlation degree.
|
||
Predictably, a stronger correlation degree generates higher privacy loss while widening the gap between the different distribution cases.
|
||
On the contrary, a weaker correlation degree makes it harder to differentiate among the {\thethings} distributions.
|
||
|
||
\begin{figure}[htp]
|
||
\centering
|
||
\subcaptionbox{Weak correlation\label{fig:dist-cor-wk}}{%
|
||
\includegraphics[width=.5\linewidth]{dist-cor-wk}%
|
||
}%
|
||
\hspace{\fill}
|
||
\subcaptionbox{Moderate correlation\label{fig:dist-cor-mod}}{%
|
||
\includegraphics[width=.5\linewidth]{dist-cor-mod}%
|
||
}%
|
||
\subcaptionbox{Strong correlation\label{fig:dist-cor-stg}}{%
|
||
\includegraphics[width=.5\linewidth]{dist-cor-stg}%
|
||
}%
|
||
\caption{Privacy loss for different {\thethings} percentages and distributions, under weak, moderate, and strong degrees of temporal correlation.
|
||
The line shows the overall privacy loss without temporal correlation.}
|
||
\label{fig:dist-cor}
|
||
\end{figure}
|
||
|
||
|
||
\section{Summary and future work}
|
||
\label{sec:lmdk-sum}
|
||
In this chapter, we presented \emph{{\thething} privacy} for privacy-preserving time series publishing, which allows for the protection of significant events, while improving the utility of the final result w.r.t. the traditional user-level differential privacy.
|
||
We also proposed three models for {\thething} privacy, and quantified the privacy loss under temporal correlation.
|
||
Our experiments on real and synthetic data sets validate our proposal.
|
||
In the future, we aim to investigate privacy-preserving {\thething} selection and propose a mechanism based on user-preferences and semantics.
|