129 lines
10 KiB
TeX
129 lines
10 KiB
TeX
\section{Evaluation}
|
|
\label{sec:the-thing-eval}
|
|
|
|
In this section we present the experiments that we performed on real and synthetic data sets.
|
|
With the experiments on the synthetic data sets we show the privacy loss by our framework when tuning the size and statistical characteristics of the input {\thething} set $L$.
|
|
We also show how the privacy loss under temporal correlation is affected by the number and distribution of the {\thethings}.
|
|
With the experiments on the real data sets, we show the performance in terms of utility of our three {\thething} mechanisms.
|
|
|
|
Notice that in our experiments, in the cases when we have $0\%$ and $100\%$ of the events being {\thethings}, we get the same behavior as in event- and user-level privacy respectively.
|
|
This happens due the fact that at each timestamp we take into account only the data items at the current timestamp and ignore the rest of the time series (event-level) when there are no {\thethings}.
|
|
Whereas, when each timestamp corresponds to a {\thething} we consider and protect all the events throughout the entire series (user-level).
|
|
|
|
|
|
\subsection{Setting, configurations, and data sets}
|
|
\paragraph{Setting}
|
|
We implemented our experiments\footnote{Code available at \url{https://gitlab.com/adhesivegoldfinch/cikm}} in Python $3$.$9$.$5$ and executed them on a machine with Intel i$7$-$6700$HQ $3$.$5$GHz CPU and $16$GB RAM, running Manjaro $21$.$0$.$5$.
|
|
We repeated each experiment $100$ times and we report the mean over these iterations.
|
|
|
|
|
|
\paragraph{Data sets}
|
|
For the \emph{real} data sets, we used the Geolife~\cite{zheng2010geolife} and T-drive~\cite{yuan2010t} from which we sampled the first $1000$ data items.
|
|
We achieved the desired {\thethings} percentages by utilizing the method of Li et al.~\cite{li2008mining} for detecting stay points in trajectory data.
|
|
In more detail, the algorithm checks for each data item if each subsequent item is within a given distance threshold $\Delta l$ and measures the time period $\Delta t$ between the present point and the last subsequent point.
|
|
We achieve $0$, $20$ $40$, $60$, $80$, and $100$ {\thethings} percentages by setting the ($\Delta l$ in meters, $\Delta t$ in minutes) pairs input to the stay point discovery method for T-drive as [($0$, $1000$), ($2095$, $30$), ($2790$, $30$), ($3590$, $30$), ($4825$, $30$), ($10350$, $30$)] and for Geolife as [($0$, $100000$), ($205$, $30$), ($450$, $30$), ($725$, $30$), ($855$, $30$), ($50000$, $30$)].
|
|
|
|
|
|
Next, we generated synthetic time series of length equal to $100$ timestamps, for which we varied the number and distribution of {\thethings}.
|
|
% to achieve the necessary {\thethings} distribution and percentage for where applicable.
|
|
% \paragraph{{\Thethings} distribution}
|
|
We created \emph{left-skewed} (the {\thethings} are distributed towards the end), \emph{symmetric} (in the middle), \emph{right-skewed} (in the beginning), \emph{bimodal} (both end and beginning), and \emph{uniform} (all over the time series) {\thething} distributions.
|
|
%, in the beginning and in the end (\emph{bimodal}), and all over the extend (\emph{uniform}) of a time series.
|
|
When pertinent, we group the left- and right-skewed cases as simply `skewed', since they share several features due to symmetry.
|
|
In order to get {\thethings} with the above distribution features, we generate probability distributions with appropriate characteristics and sample from them, without replacement, the desired number of points.
|
|
%The generated distributions are representative of the cases that we wish to examine during the experiments.
|
|
% For example, for a left-skewed {\thethings} distribution we would utilize a truncated distribution resulting from the restriction of the domain of a normal distribution to the beginning and end of the time series with its location shifted to the center of the right half of the series.
|
|
For consistency, we calculate the scale parameter depending on the length of the series by setting it equal to the series' length over a constant.
|
|
%We take into account only the temporal order of the points and the position of regular and {\thething} events within the series.
|
|
Note, that for the experiments performed on the synthetic data sets, the original values to be released do not influence the outcome of our conclusions, thus we ignore them.
|
|
|
|
|
|
\paragraph{Configurations}
|
|
We model the temporal correlation in the synthetic data as a \emph{stochastic matrix} $P$, using a \emph{Markov Chain}~\cite{gagniuc2017markov}.
|
|
$P$ is a $n \times n$ matrix, where the element $p_{ij}$
|
|
%at the $i$th row of the $j$th column that
|
|
represents the transition probability from a state $i$ to another state $j$.
|
|
%, $\forall i, j \leq n$.
|
|
It holds that the elements of every row $j$ of $P$ sum up to $1$.
|
|
We follow the \emph{Laplacian smoothing} technique~\cite{sorkine2004laplacian} as utilized in~\cite{cao2018quantifying} to generate the matrix $P$ with a degree of temporal correlation $s>0$.
|
|
% and generate a stochastic matrix $P$ with a degree of temporal correlation $s$ by calculating each element $P_{ij}$ as follows
|
|
%$$\frac{(I_{n})_{ij} + s}{\sum_{k = 1}^{n}((I_{n})_{jk} + s)}$$
|
|
%where $I_{n}$ is an \emph{identity matrix} of size $n$.
|
|
%, i.e.,~an $n \times n$ matrix with $1$s on its main diagonal and $0$s elsewhere.
|
|
% $s$ takes only positive values which are comparable only for stochastic matrices of the same size.
|
|
$s$ dictates the strength of the correlation; the lower its value,
|
|
%the lower the degree of uniformity of each row, and therefore
|
|
the stronger the correlation degree.
|
|
%In general, larger transition matrices tend to be uniform, resulting in weaker correlation.
|
|
In our experiments, for simplicity, we set $n = 2$ and we investigate the effect of \emph{weak} ($s = 1$), \emph{moderate} ($s = 0.1$), and \emph{strong} ($s = 0.01$) temporal correlation degree on the overall privacy loss.
|
|
|
|
We set $\varepsilon = 1$.
|
|
To perturb the spatial values of the real data sets, we inject noise that we sample from the Planar Laplace mechanism~\cite{andres2013geo}.
|
|
Finally, notice that all diagrams are in logarithmic scale.
|
|
|
|
\subsection{Experiments}
|
|
|
|
\paragraph{Budget allocation schemes}
|
|
|
|
Figure~\ref{fig:real} exhibits the performance of the three mechanisms: Skip, Uniform, and Adaptive.
|
|
|
|
\begin{figure}[htp]
|
|
\centering
|
|
\subcaptionbox{Geolife\label{fig:geolife}}{%
|
|
\includegraphics[width=.5\linewidth]{geolife}%
|
|
}%
|
|
\subcaptionbox{T-drive\label{fig:t-drive}}{%
|
|
\includegraphics[width=.5\linewidth]{t-drive}%
|
|
}%
|
|
\caption{The mean absolute error (in meters) of the released data for different {\thethings} percentages.}
|
|
\label{fig:real}
|
|
\end{figure}
|
|
|
|
For the Geolife data set (Figure~\ref{fig:geolife}), Skip has the best performance (measured in Mean Absolute Error, in meters) because it invests the most budget overall at every regular event, by approximating the {\thething} data based on previous releases.
|
|
Due to the data set's high density (every $1$--$5$ seconds or every $5$--$10$ meters per point) approximating constantly has a low impact on the data utility.
|
|
On the contrary, the lower density of the T-drive data set (Figure~\ref{fig:t-drive}) has a negative impact on the performance of Skip.
|
|
In the T-drive data set, the Adaptive mechanism outperforms the Uniform one by $10$\%--$20$\% for all {\thethings} percentages greater than $0$ and by more than $20$\% the Skip one.
|
|
In general, we can claim that the Adaptive is the best performing mechanism, if we take into consideration the drawbacks of the Skip mechanism mentioned in Section~\ref{subsec:lmdk-mechs}. Moreover, designing a data-dependent sampling scheme would possibly result in better results for Adaptive.
|
|
|
|
|
|
\paragraph{Temporal distance and correlation}
|
|
Figure~\ref{fig:avg-dist} shows a comparison of the average temporal distance of the events from the previous/next {\thething} or the start/end of the time series for various distributions in synthetic data.
|
|
More particularly, we count for every event the total number of events between itself and the nearest {\thething} or the series edge.
|
|
We observe that the uniform and bimodal distributions tend to limit the regular event--{\thething} distance.
|
|
This is due to the fact that the former scatters the {\thethings}, while the latter distributes them on both edges, leaving a shorter space uninterrupted by {\thethings}.
|
|
% and as a result they reduce the uninterrupted space by landmarks in the sequence.
|
|
On the contrary, distributing the {\thethings} at one part of the sequence, as in skewed or symmetric, creates a wider space without {\thethings}.
|
|
|
|
\begin{figure}[htp]
|
|
\centering
|
|
\includegraphics[width=.5\linewidth]{avg-dist}%
|
|
\caption{Average temporal distance of the events from the {\thethings} for different {\thethings} percentages within a time series in various {\thethings} distributions.}
|
|
\label{fig:avg-dist}
|
|
\end{figure}
|
|
|
|
Figure~\ref{fig:dist-cor} illustrates a comparison among the aforementioned distributions regarding the overall privacy loss under moderate (Figure~\ref{fig:dist-cor-mod}), and strong (Figure~\ref{fig:dist-cor-stg}) correlation degrees.
|
|
The line shows the overall privacy loss---for all cases of {\thethings} distribution---without temporal correlation.
|
|
We skip the presentation of the results under a weak correlation degree, since they converge in this case.
|
|
In combination with Figure~\ref{fig:avg-dist}, we conclude that a greater average event-{\thething} distance in a distribution can result into greater overall privacy loss under moderate and strong temporal correlation.
|
|
This is due to the fact that the backward/forward privacy loss accumulates more over time in wider spaces without {\thethings} (see Section~\ref{subsec:correlations}).
|
|
Furthermore, the behavior of the privacy loss is as expected regarding the temporal correlation degree.
|
|
Predictably, a stronger correlation degree generates higher privacy loss while widening the gap between the different distribution cases.
|
|
On the contrary, a weaker correlation degree makes it harder to differentiate among the {\thethings} distributions.
|
|
|
|
\begin{figure}[htp]
|
|
\centering
|
|
\subcaptionbox{Weak correlation\label{fig:dist-cor-wk}}{%
|
|
\includegraphics[width=.5\linewidth]{dist-cor-wk}%
|
|
}%
|
|
\hspace{\fill}
|
|
\subcaptionbox{Moderate correlation\label{fig:dist-cor-mod}}{%
|
|
\includegraphics[width=.5\linewidth]{dist-cor-mod}%
|
|
}%
|
|
\subcaptionbox{Strong correlation\label{fig:dist-cor-stg}}{%
|
|
\includegraphics[width=.5\linewidth]{dist-cor-stg}%
|
|
}%
|
|
\caption{Privacy loss for different {\thethings} percentages and distributions, under weak, moderate, and strong degrees of temporal correlation.
|
|
The line shows the overall privacy loss without temporal correlation.}
|
|
\label{fig:dist-cor}
|
|
\end{figure}
|