diff --git a/graphics/the-thing.tex b/graphics/the-thing.tex deleted file mode 100644 index 4cc6f02..0000000 --- a/graphics/the-thing.tex +++ /dev/null @@ -1,210 +0,0 @@ -\chapter{Significant events} -\label{ch:the-thing} - -In this chapter, we propose a novel configurable privacy scheme, \emph{\thething} privacy, which takes into account significant events (\emph{\thethings}) in the time series and allocates the available privacy budget accordingly. -We propose two privacy models that guarantee {\thething} privacy and validate our proposal on real and synthetic data sets. - - -\section{Motivation} -\label{sec:the-thing-motiv} - -The plethora of sensors currently embedded in -or paired with personal devices and other infrastructures have paved the way for the development of numerous \emph{crowdsensing services} (e.g.,~Google Maps~\cite{gmaps}, Waze~\cite{waze}, etc.) based on the collected personal, and usually geotagged and timestamped data. - -User--service interactions gather personal event-like data, e.g.,~(\emph{`Bob', `dining', `Canal Saint-Martin', $17{:}00$}). -When the interactions are performed in a continuous manner, we obtain ~\emph{time series} of events. -An \emph{event} represents a user--service interaction, registering the information of the individual at a specific time point, i.e.,~a data item that is a pair of an identifying attribute of an individual and the---possibly sensitive---information at a timestamp (including contextual information). -It can be seen as a correspondence to a record in a database, where each individual may participate once, e.g.,~(`Bob', `dining', `Canal Saint-Martin', $5$). -Typically, users interact with the services more than once, generating data in a continuous manner (\emph{time series}). -The services collect and further process the time series in order to give useful feedback to the involved users or to provide valuable insight to various internal/external analytical services. -Depending on its span, we distinguish the processing into \emph{finite}, when taking place during a predefined time interval, and \emph{infinite}, when taking place in an uninterrupted fashion. -% Figure~\ref{fig:scenario} shows an example of a finite time series produced by a user (Bob) and composed by $8$ timestamps during his trajectory from his home (\'Elys\'ee) to his work (Louvre) to his hangout (Saint-Martin) and back to his home. - -\begin{example} - \label{ex:lmdk-scenario} - - Consider a finite sequence of spatiotemporal data generated by Bob during an interval of $\ 8$ timestamps, as shown in Figure~\ref{fig:lmdk-scenario}. - These data are the result of user--LBS interaction while retrieving location-based information or reporting user-state at various locations. - Events in a shade correspond to privacy-sensitive events that Bob has defined beforehand. For instance his home is around {\'E}lys{\'e}e, his workplace is around the Louvre, and his hangout is around Canal Saint-Martin. - - \begin{figure}[htp] - \centering - \includegraphics[width=\linewidth]{lmdk-scenario} - \caption{A time series with {\thethings} (highlighted in gray).} - \label{fig:lmdk-scenario} - \end{figure} - -\end{example} - - -The regulation regarding the processing of user-generated data sets~\cite{tankard2016gdpr} requires the provision of privacy guarantees to the users. -At the same time, it is essential to provide utility metrics to the final consumers of the privacy-preserving process output. -A widely recognized tool that introduces probabilistic randomness to the original data, while quantifying with a parameter $\varepsilon$ (`privacy budget'~\cite{mcsherry2009privacy}) the privacy/utility ratio is \emph{$\varepsilon$-differential privacy}~\cite{dwork2006calibrating}. -\emph{Event}, \emph{user}~\cite{dwork2010differential, dwork2010pan}, and \emph{$w$-event}~\cite{kellaris2014differentially} comprise the possible levels of privacy protection. -Event-level limits the privacy protection to \emph{any single event}, user-level protects \emph{all the events} of any user, and $w$-event provides privacy protection to \emph{any sequence of $w$ events}. - -The privacy mechanisms for the aforementioned levels assume that in a time series any single event, or any sequence of events, or the entire series of events is equally privacy-significant for the users. -In reality, this is an simplistic assumption. -The significance of an event is related to certain user-defined privacy criteria, or to its adjacent events, as well as to the entire time series. -Identifying \emph{\thething} (significant) events can be done in an automatic or manual way (but is out of scope for this work). -For example, in spatiotemporal data, \emph{places where an individual spent some time} denote \emph{points of interest} (POIs) (called also stay points)~\cite{zheng2015trajectory}. -Such data items, and more particularly their spatial attribute values, can be less privacy-sensitive~\cite{primault2018long}, e.g.,~parks, theaters, etc. or, if individuals frequent them, they can reveal supplementary information, e.g.,~residences (home addresses)~\cite{gambs2010show}, places of worship (religious beliefs)~\cite{franceschi-bicchierairussell2015redditor}, etc. -POIs can be an example of how we can choose {\thethings}, but the idea is not limited to these. - -\begin{figure}[htp] - \centering - \includegraphics[width=\linewidth]{st-cont} - \caption{User-level and {\thething} $\varepsilon$-differential privacy protection for the time series of Figure~\ref{fig:lmdk-scenario}.} - \label{fig:st-cont} -\end{figure} - -We argue that protecting only {\thething} events along with any regular event release is sufficient for the user's protection, while it improves data utility. -Take for example the scenario in Figure~\ref{fig:st-cont}, where {\thethings} are highlighted in gray. -If we want to protect the {\thething} points, we have to allocate at most a budget of $\varepsilon$ to the {\thethings}. -Notice that the more budget we allocate to an event the less we protect it, but at the same time we maintain its utility. -In this scenario, event-level protection is not suitable since it can only protect one event at a time. -Hence, we have to apply user-level privacy protection by distributing equal portions of $\varepsilon$ to all the events, i.e.,~$\frac{\varepsilon}{8}$ to each one (the equivalent of applying $8$-event privacy). -In this way, we have protected the {\thething} points; we have allocated a total of $\frac{\varepsilon}{2}<\varepsilon$ to the {\thethings}. -However, perturbing by $\frac{\varepsilon}{8}$ each regular point deteriorates the data utility unnecessarily. -With {\thething} privacy we propose to distribute the budget taking into account only the existence of the {\thethings} when we release an event of the time series, i.e.,~allocating $\frac{\varepsilon}{5}$ ($4\ \text{\thethings} + 1\ \text{regular point}$) to each event (see Figure~\ref{fig:st-cont}). -This way, we still guarantee that the {\thethings} are adequately protected, as they receive a total budget of $\frac{4\varepsilon}{5}<\varepsilon$. -At the same time, we avoid over-perturbing the regular events, as we allocate to them a higher total budget ($\frac{4\varepsilon}{5}$) than in user-level ($\frac{\varepsilon}{2}$), and thus less noise. - - -\section{Contribution} -\label{sec:the-thing-contrib} - -In this chapter, we formally define a novel privacy notion that we call \emph{{\thething} privacy}. -We apply this privacy notion to time series consisting of \emph{{\thethings}} and regular events, and we design and implement three {\thething} privacy mechanisms. -We further study {\thething} privacy under temporal correlation that is inherent in time series publishing. -Finally, we evaluate {\thething} privacy with real and synthetic data sets, in settings with or without temporal correlation, showcasing the validity of our model. - - -\section{Evaluation} -\label{sec:the-thing-eval} - -In this section we present the experiments that we performed on real and synthetic data sets. -With the experiments on the synthetic data sets we show the privacy loss by our framework when tuning the size and statistical characteristics of the input {\thething} set $L$. -We also show how the privacy loss under temporal correlation is affected by the number and distribution of the {\thethings}. -With the experiments on the real data sets, we show the performance in terms of utility of our three {\thething} mechanisms. - -Notice that in our experiments, in the cases when we have $0\%$ and $100\%$ of the events being {\thethings}, we get the same behavior as in event- and user-level privacy respectively. -This happens due the fact that at each timestamp we take into account only the data items at the current timestamp and ignore the rest of the time series (event-level) when there are no {\thethings}. -Whereas, when each timestamp corresponds to a {\thething} we consider and protect all the events throughout the entire series (user-level). - - -\subsection{Setting, configurations, and data sets} -\paragraph{Setting} -We implemented our experiments\footnote{Code available at \url{https://gitlab.com/adhesivegoldfinch/cikm}} in Python $3$.$9$.$5$ and executed them on a machine with Intel i$7$-$6700$HQ $3$.$5$GHz CPU and $16$GB RAM, running Manjaro $21$.$0$.$5$. -We repeated each experiment $100$ times and we report the mean over these iterations. - - -\paragraph{Data sets} -For the \emph{real} data sets, we used the Geolife~\cite{zheng2010geolife} and T-drive~\cite{yuan2010t} from which we sampled the first $1000$ data items. -We achieved the desired {\thethings} percentages by utilizing the method of Li et al.~\cite{li2008mining} for detecting stay points in trajectory data. -In more detail, the algorithm checks for each data item if each subsequent item is within a given distance threshold $\Delta l$ and measures the time period $\Delta t$ between the present point and the last subsequent point. -We achieve $0$, $20$ $40$, $60$, $80$, and $100$ {\thethings} percentages by setting the ($\Delta l$ in meters, $\Delta t$ in minutes) pairs input to the stay point discovery method for T-drive as [($0$, $1000$), ($2095$, $30$), ($2790$, $30$), ($3590$, $30$), ($4825$, $30$), ($10350$, $30$)] and for Geolife as [($0$, $100000$), ($205$, $30$), ($450$, $30$), ($725$, $30$), ($855$, $30$), ($50000$, $30$)]. - - -Next, we generated synthetic time series of length equal to $100$ timestamps, for which we varied the number and distribution of {\thethings}. -% to achieve the necessary {\thethings} distribution and percentage for where applicable. -% \paragraph{{\Thethings} distribution} -We created \emph{left-skewed} (the {\thethings} are distributed towards the end), \emph{symmetric} (in the middle), \emph{right-skewed} (in the beginning), \emph{bimodal} (both end and beginning), and \emph{uniform} (all over the time series) {\thething} distributions. -%, in the beginning and in the end (\emph{bimodal}), and all over the extend (\emph{uniform}) of a time series. -When pertinent, we group the left- and right-skewed cases as simply `skewed', since they share several features due to symmetry. -In order to get {\thethings} with the above distribution features, we generate probability distributions with appropriate characteristics and sample from them, without replacement, the desired number of points. -%The generated distributions are representative of the cases that we wish to examine during the experiments. -% For example, for a left-skewed {\thethings} distribution we would utilize a truncated distribution resulting from the restriction of the domain of a normal distribution to the beginning and end of the time series with its location shifted to the center of the right half of the series. -For consistency, we calculate the scale parameter depending on the length of the series by setting it equal to the series' length over a constant. -%We take into account only the temporal order of the points and the position of regular and {\thething} events within the series. -Note, that for the experiments performed on the synthetic data sets, the original values to be released do not influence the outcome of our conclusions, thus we ignore them. - - -\paragraph{Configurations} -We model the temporal correlation in the synthetic data as a \emph{stochastic matrix} $P$, using a \emph{Markov Chain}~\cite{gagniuc2017markov}. -$P$ is a $n \times n$ matrix, where the element $p_{ij}$ -%at the $i$th row of the $j$th column that -represents the transition probability from a state $i$ to another state $j$. -%, $\forall i, j \leq n$. -It holds that the elements of every row $j$ of $P$ sum up to $1$. -We follow the \emph{Laplacian smoothing} technique~\cite{sorkine2004laplacian} as utilized in~\cite{cao2018quantifying} to generate the matrix $P$ with a degree of temporal correlation $s>0$. -% and generate a stochastic matrix $P$ with a degree of temporal correlation $s$ by calculating each element $P_{ij}$ as follows -%$$\frac{(I_{n})_{ij} + s}{\sum_{k = 1}^{n}((I_{n})_{jk} + s)}$$ -%where $I_{n}$ is an \emph{identity matrix} of size $n$. -%, i.e.,~an $n \times n$ matrix with $1$s on its main diagonal and $0$s elsewhere. -% $s$ takes only positive values which are comparable only for stochastic matrices of the same size. -$s$ dictates the strength of the correlation; the lower its value, -%the lower the degree of uniformity of each row, and therefore -the stronger the correlation degree. -%In general, larger transition matrices tend to be uniform, resulting in weaker correlation. -In our experiments, for simplicity, we set $n = 2$ and we investigate the effect of \emph{weak} ($s = 1$), \emph{moderate} ($s = 0.1$), and \emph{strong} ($s = 0.01$) temporal correlation degree on the overall privacy loss. - -We set $\varepsilon = 1$. -To perturb the spatial values of the real data sets, we inject noise that we sample from the Planar Laplace mechanism~\cite{andres2013geo}. -Finally, notice that all diagrams are in logarithmic scale. - -\subsection{Experiments} - -\paragraph{Budget allocation schemes} - -Figure~\ref{fig:real} exhibits the performance of the three mechanisms: Skip, Uniform, and Adaptive. - -\begin{figure}[htp] - \centering - \subcaptionbox{Geolife\label{fig:geolife}}{% - \includegraphics[width=.5\linewidth]{geolife}% - }% - \subcaptionbox{T-drive\label{fig:t-drive}}{% - \includegraphics[width=.5\linewidth]{t-drive}% - }% - \caption{The mean absolute error (in meters) of the released data for different {\thethings} percentages.} - \label{fig:real} -\end{figure} - -For the Geolife data set (Figure~\ref{fig:geolife}), Skip has the best performance (measured in Mean Absolute Error, in meters) because it invests the most budget overall at every regular event, by approximating the {\thething} data based on previous releases. -Due to the data set's high density (every $1$--$5$ seconds or every $5$--$10$ meters per point) approximating constantly has a low impact on the data utility. -On the contrary, the lower density of the T-drive data set (Figure~\ref{fig:t-drive}) has a negative impact on the performance of Skip. -In the T-drive data set, the Adaptive mechanism outperforms the Uniform one by $10$\%--$20$\% for all {\thethings} percentages greater than $0$ and by more than $20$\% the Skip one. -In general, we can claim that the Adaptive is the best performing mechanism, if we take into consideration the drawbacks of the Skip mechanism mentioned in Section~\ref{subsec:lmdk-mechs}. Moreover, designing a data-dependent sampling scheme would possibly result in better results for Adaptive. - - -\paragraph{Temporal distance and correlation} -Figure~\ref{fig:avg-dist} shows a comparison of the average temporal distance of the events from the previous/next {\thething} or the start/end of the time series for various distributions in synthetic data. -More particularly, we count for every event the total number of events between itself and the nearest {\thething} or the series edge. -We observe that the uniform and bimodal distributions tend to limit the regular event--{\thething} distance. -This is due to the fact that the former scatters the {\thethings}, while the latter distributes them on both edges, leaving a shorter space uninterrupted by {\thethings}. -% and as a result they reduce the uninterrupted space by landmarks in the sequence. -On the contrary, distributing the {\thethings} at one part of the sequence, as in skewed or symmetric, creates a wider space without {\thethings}. - -\begin{figure}[htp] - \centering - \includegraphics[width=.5\linewidth]{avg-dist}% - \caption{Average temporal distance of the events from the {\thethings} for different {\thethings} percentages within a time series in various {\thethings} distributions.} - \label{fig:avg-dist} -\end{figure} - -Figure~\ref{fig:dist-cor} illustrates a comparison among the aforementioned distributions regarding the overall privacy loss under moderate (Figure~\ref{fig:dist-cor-mod}), and strong (Figure~\ref{fig:dist-cor-stg}) correlation degrees. -The line shows the overall privacy loss---for all cases of {\thethings} distribution---without temporal correlation. -We skip the presentation of the results under a weak correlation degree, since they converge in this case. -In combination with Figure~\ref{fig:avg-dist}, we conclude that a greater average event-{\thething} distance in a distribution can result into greater overall privacy loss under moderate and strong temporal correlation. -This is due to the fact that the backward/forward privacy loss accumulates more over time in wider spaces without {\thethings} (see Section~\ref{subsec:correlations}). -Furthermore, the behavior of the privacy loss is as expected regarding the temporal correlation degree. -Predictably, a stronger correlation degree generates higher privacy loss while widening the gap between the different distribution cases. -On the contrary, a weaker correlation degree makes it harder to differentiate among the {\thethings} distributions. - -\begin{figure}[htp] - \centering - \subcaptionbox{Weak correlation\label{fig:dist-cor-wk}}{% - \includegraphics[width=.5\linewidth]{dist-cor-wk}% - }% - \hspace{\fill} - \subcaptionbox{Moderate correlation\label{fig:dist-cor-mod}}{% - \includegraphics[width=.5\linewidth]{dist-cor-mod}% - }% - \subcaptionbox{Strong correlation\label{fig:dist-cor-stg}}{% - \includegraphics[width=.5\linewidth]{dist-cor-stg}% - }% - \caption{Privacy loss for different {\thethings} percentages and distributions, under moderate and strong degrees of temporal correlation. - The line shows the overall privacy loss without temporal correlation.} - \label{fig:dist-cor} -\end{figure}