the-last-thing/text/the-thing/motivation.tex

64 lines
6.5 KiB
TeX
Raw Normal View History

2021-07-18 17:31:05 +02:00
\section{Motivation}
\label{sec:lmdk-motiv}
The plethora of sensors currently embedded in
or paired with personal devices and other infrastructures have paved the way for the development of numerous \emph{crowdsensing services} (e.g.,~Google Maps~\cite{gmaps}, Waze~\cite{waze}, etc.) based on the collected personal, and usually geotagged and timestamped data.
User--service interactions gather personal event-like data, e.g.,~(\emph{`Bob', `dining', `Canal Saint-Martin', $17{:}00$}).
When the interactions are performed in a continuous manner, we obtain ~\emph{time series} of events.
An \emph{event} represents a user--service interaction, registering the information of the individual at a specific time point, i.e.,~a data item that is a pair of an identifying attribute of an individual and the---possibly sensitive---information at a timestamp (including contextual information).
It can be seen as a correspondence to a record in a database, where each individual may participate once, e.g.,~(`Bob', `dining', `Canal Saint-Martin', $5$).
Typically, users interact with the services more than once, generating data in a continuous manner (\emph{time series}).
The services collect and further process the time series in order to give useful feedback to the involved users or to provide valuable insight to various internal/external analytical services.
Depending on its span, we distinguish the processing into \emph{finite}, when taking place during a predefined time interval, and \emph{infinite}, when taking place in an uninterrupted fashion.
% Figure~\ref{fig:scenario} shows an example of a finite time series produced by a user (Bob) and composed by $8$ timestamps during his trajectory from his home (\'Elys\'ee) to his work (Louvre) to his hangout (Saint-Martin) and back to his home.
\begin{example}
\label{ex:lmdk-scenario}
Consider a finite sequence of spatiotemporal data generated by Bob during an interval of $\ 8$ timestamps, as shown in Figure~\ref{fig:lmdk-scenario}.
These data are the result of user--LBS interaction while retrieving location-based information or reporting user-state at various locations.
Events in a shade correspond to privacy-sensitive events that Bob has defined beforehand. For instance his home is around {\'E}lys{\'e}e, his workplace is around the Louvre, and his hangout is around Canal Saint-Martin.
\begin{figure}[htp]
\centering
\includegraphics[width=\linewidth]{lmdk-scenario}
\caption{A time series with {\thethings} (highlighted in gray).}
\label{fig:lmdk-scenario}
\end{figure}
\end{example}
The regulation regarding the processing of user-generated data sets~\cite{tankard2016gdpr} requires the provision of privacy guarantees to the users.
At the same time, it is essential to provide utility metrics to the final consumers of the privacy-preserving process output.
A widely recognized tool that introduces probabilistic randomness to the original data, while quantifying with a parameter $\varepsilon$ (`privacy budget'~\cite{mcsherry2009privacy}) the privacy/utility ratio is \emph{$\varepsilon$-differential privacy}~\cite{dwork2006calibrating}.
\emph{Event}, \emph{user}~\cite{dwork2010differential, dwork2010pan}, and \emph{$w$-event}~\cite{kellaris2014differentially} comprise the possible levels of privacy protection.
Event-level limits the privacy protection to \emph{any single event}, user-level protects \emph{all the events} of any user, and $w$-event provides privacy protection to \emph{any sequence of $w$ events}.
The privacy mechanisms for the aforementioned levels assume that in a time series any single event, or any sequence of events, or the entire series of events is equally privacy-significant for the users.
In reality, this is an simplistic assumption.
The significance of an event is related to certain user-defined privacy criteria, or to its adjacent events, as well as to the entire time series.
Identifying \emph{\thething} (significant) events can be done in an automatic or manual way (but is out of scope for this work).
For example, in spatiotemporal data, \emph{places where an individual spent some time} denote \emph{points of interest} (POIs) (called also stay points)~\cite{zheng2015trajectory}.
Such data items, and more particularly their spatial attribute values, can be less privacy-sensitive~\cite{primault2018long}, e.g.,~parks, theaters, etc. or, if individuals frequent them, they can reveal supplementary information, e.g.,~residences (home addresses)~\cite{gambs2010show}, places of worship (religious beliefs)~\cite{franceschi-bicchierairussell2015redditor}, etc.
POIs can be an example of how we can choose {\thethings}, but the idea is not limited to these.
\begin{figure}[htp]
\centering
\includegraphics[width=\linewidth]{st-cont}
\caption{User-level and {\thething} $\varepsilon$-differential privacy protection for the time series of Figure~\ref{fig:lmdk-scenario}.}
\label{fig:st-cont}
\end{figure}
We argue that protecting only {\thething} events along with any regular event release is sufficient for the user's protection, while it improves data utility.
Take for example the scenario in Figure~\ref{fig:st-cont}, where {\thethings} are highlighted in gray.
If we want to protect the {\thething} points, we have to allocate at most a budget of $\varepsilon$ to the {\thethings}.
Notice that the more budget we allocate to an event the less we protect it, but at the same time we maintain its utility.
In this scenario, event-level protection is not suitable since it can only protect one event at a time.
Hence, we have to apply user-level privacy protection by distributing equal portions of $\varepsilon$ to all the events, i.e.,~$\frac{\varepsilon}{8}$ to each one (the equivalent of applying $8$-event privacy).
In this way, we have protected the {\thething} points; we have allocated a total of $\frac{\varepsilon}{2}<\varepsilon$ to the {\thethings}.
However, perturbing by $\frac{\varepsilon}{8}$ each regular point deteriorates the data utility unnecessarily.
With {\thething} privacy we propose to distribute the budget taking into account only the existence of the {\thethings} when we release an event of the time series, i.e.,~allocating $\frac{\varepsilon}{5}$ ($4\ \text{\thethings} + 1\ \text{regular point}$) to each event (see Figure~\ref{fig:st-cont}).
This way, we still guarantee that the {\thethings} are adequately protected, as they receive a total budget of $\frac{4\varepsilon}{5}<\varepsilon$.
At the same time, we avoid over-perturbing the regular events, as we allocate to them a higher total budget ($\frac{4\varepsilon}{5}$) than in user-level ($\frac{\varepsilon}{2}$), and thus less noise.