the-last-thing/text/evaluation/thething.tex

142 lines
11 KiB
TeX
Raw Normal View History

\section{{\Thething} events}
2021-10-11 11:08:03 +02:00
\label{sec:eval-lmdk}
2021-09-07 16:06:42 +02:00
% \kat{After discussing with Dimitris, I thought you are keeping one chapter for the proposals of the thesis. In this case, it would be more clean to keep the theoretical contributions in one chapter and the evaluation in a separate chapter. }
% \mk{OK.}
2021-10-11 11:08:03 +02:00
In this section, we present the experiments that we performed, to test the methodology that we presented in Section~\ref{subsec:lmdk-sol}, on real and synthetic data sets.
2021-07-18 17:31:05 +02:00
With the experiments on the real data sets (Section~\ref{subsec:lmdk-expt-bgt}), we show the performance in terms of data utility of our three {\thething} privacy mechanisms: Skip, Uniform and Adaptive.
We define data utility as the mean absolute error introduced by the privacy mechanism.
We compare with the event- and user-level differential privacy protection levels, and show that, in the general case, {\thething} privacy allows for better data utility than user-level differential privacy while balancing between the two protection levels.
2021-07-18 17:31:05 +02:00
With the experiments on the synthetic data sets (Section~\ref{subsec:lmdk-expt-cor}) we show the temporal privacy loss,
% \kat{in the previous set of experiments we were measuring the MAE, now we are measuring the privacy loss... Why is that? Isn't it two sides of the same coin? }
2021-10-14 14:30:35 +02:00
i.e.,~the privacy budget $\varepsilon$ with the extra privacy loss because of the temporal correlation, under temporal correlation within our framework when tuning the size and statistical characteristics of the input {\thething} set $L$.
% \kat{mention briefly what you observe}
We observe that a greater average {\thething}--regular event distance in a time series can result into greater temporal privacy loss under moderate and strong temporal correlation.
2021-07-18 17:31:05 +02:00
2021-10-12 17:26:52 +02:00
\subsection{Budget allocation schemes}
\label{subsec:lmdk-expt-bgt}
2021-10-12 17:58:22 +02:00
Figure~\ref{fig:real} exhibits the performance of the three mechanisms, Skip, Uniform, and Adaptive applied on the three data sets that we study.
Notice that, in the cases when we have $0\%$ and $100\%$ of the events being {\thethings}, we get the same behavior as in event- and user-level privacy respectively.
This happens due the fact that at each timestamp we take into account only the data items at the current timestamp and ignore the rest of the time series (event-level) when there are no {\thethings}.
Whereas, when each timestamp corresponds to a {\thething} we consider and protect all the events throughout the entire series (user-level).
2021-10-09 12:57:31 +02:00
% For the Geolife data set (Figure~\ref{fig:geolife}), Skip has the best performance (measured in Mean Absolute Error, in meters) because it invests the most budget overall at every regular event, by approximating the {\thething} data based on previous releases.
% Due to the data set's high density (every $1$--$5$ seconds or every $5$--$10$ meters per point) approximating constantly has a low impact on the data utility.
% On the contrary, the lower density of the T-drive data set (Figure~\ref{fig:t-drive}) has a negative impact on the performance of Skip.
2021-10-12 17:58:22 +02:00
2021-10-12 17:26:52 +02:00
\begin{figure}[htp]
2021-10-13 09:03:04 +02:00
\centering
\subcaptionbox{Copenhagen\label{fig:copenhagen}}{%
\includegraphics[width=.49\linewidth]{evaluation/copenhagen}%
2021-10-13 09:03:04 +02:00
}%
\\ \bigskip
\subcaptionbox{HUE\label{fig:hue}}{%
\includegraphics[width=.49\linewidth]{evaluation/hue}%
2021-10-13 09:03:04 +02:00
}%
\hfill
2021-10-13 09:03:04 +02:00
\subcaptionbox{T-drive\label{fig:t-drive}}{%
\includegraphics[width=.49\linewidth]{evaluation/t-drive}%
2021-10-13 09:03:04 +02:00
}%
\caption{The mean absolute error (a)~as a percentage, (b)~in kWh, and (c)~in meters of the released data for different {\thething} percentages.}
\label{fig:real}
2021-10-12 17:26:52 +02:00
\end{figure}
2021-10-09 12:57:31 +02:00
For the Copenhagen data set (Figure~\ref{fig:copenhagen}), Adaptive has an
% constant
% \kat{it is not constant, for 0 it is much lower}
overall consistent performance and works best for $60$\% and $80$\% {\thethings}.
% \kat{this is contradictory: you say that it is constant overall, and then that it is better for certain percentages. }.
% \mk{`Consistent' is the right word.}
We notice that for $0$\% {\thethings}, it achieves better utility than the event-level protection
% \kat{what does this mean? how is it possible?}
due to the combination of more available privacy budget per timestamp (due to the absence of {\thethings}) and its adaptive sampling methodology.
2021-10-14 14:30:35 +02:00
Skip excels, compared to the others, at cases where it needs to approximate $20$\%, $40$\%, or $100$\% of the times.
% \kat{it seems a little random.. do you have an explanation? (rather few times or all?)}
2021-10-14 14:30:35 +02:00
In general, we notice that, for this data set and due to the application of the random response technique, it is more beneficial to either invest more privacy budget per event or prefer approximation over introducing randomization.
2021-10-14 14:30:35 +02:00
The combination of the small range of measurements ($[0.28$, $4.45]$ with an average of $0.88$kWh) in HUE (Figure~\ref{fig:hue}) and the large scale in the Laplace mechanism, allows for mechanisms that favor approximation over noise injection to achieve a better performance in terms of data utility.
Hence, Skip achieves a constant low mean absolute error.
% \kat{why?explain}
Regardless, the Adaptive mechanism performs by far better than Uniform and
% strikes a nice balance\kat{???}
balances between event- and user-level protection for all {\thething} percentages.
2021-10-14 14:30:35 +02:00
In T-drive (Figure~\ref{fig:t-drive}), the Adaptive mechanism outperforms Uniform by $10$\%--$20$\% for all {\thething} percentages greater than $40$\% and Skip by more than $20$\%.
The lower density (average distance of $623$m) of the T-drive data set has a negative impact on the performance of Skip because republishing a previous perturbed value is now less accurate than perturbing the current location.
Principally, we can claim that the Adaptive is the most reliable and best performing mechanism,
% with a minimal and generic parameter tuning
% \kat{what does minimal tuning mean?}
if we take into consideration the drawbacks of the Skip mechanism, particularly in spatiotemporal data, e.g., sporadic location data publishing~\cite{gambs2010show, russell2018fitness} or misapplying location cloaking~\cite{xssfopes2020tweet}, that could lead to the indication of privacy-sensitive attribute values.
% (mentioned in Section~\ref{subsec:lmdk-mechs})
% \kat{you can mention them also here briefly, and give the pointer for the section}
Moreover, implementing a more advanced and data-dependent sampling method
% \kat{what would be the main characteristic of the scheme? that it picks landmarks how?}
that accounts for changes in the trends of the input data and adapts its rate accordingly, would
% possibly
% \kat{possibly is not good enough, if you are sure remove it. Otherwise mention that more experiments need to be done?}
result in a more effective budget allocation that would improve the performance of Adaptive in terms of data utility.
2021-07-18 17:31:05 +02:00
2021-10-09 15:39:31 +02:00
\subsection{Temporal distance and correlation}
\label{subsec:lmdk-expt-cor}
2021-10-14 14:30:35 +02:00
As previously mentioned, temporal correlation is inherent in continuous publishing, and it is the cause of supplementary privacy loss in the case of privacy-preserving time series publishing.
In this section, we are interested in studying the effect that the distance of the {\thethings} from every regular event has on the loss caused under the presence of temporal correlation.
2021-10-12 22:27:42 +02:00
Figure~\ref{fig:avg-dist} shows a comparison of the average temporal distance of the events from the previous/next {\thething} or the start/end of the time series for various distributions in our synthetic data.
2021-10-14 14:30:35 +02:00
More specifically, we model the distance of an event as the count of the total number of events between itself and the nearest {\thething} or the time series edge.
2021-07-18 17:31:05 +02:00
\begin{figure}[htp]
\centering
\includegraphics[width=.5\linewidth]{evaluation/avg-dist}%
\caption{Average temporal distance of regular events from the {\thethings} for different {\thethings} percentages within a time series in various {\thething} distributions.}
2021-07-18 17:31:05 +02:00
\label{fig:avg-dist}
\end{figure}
We observe that the uniform and bimodal distributions tend to limit the regular event--{\thething} distance.
This is due to the fact that the former scatters the {\thethings}, while the latter distributes them on both edges, leaving a shorter space uninterrupted by {\thethings}.
% and as a result they reduce the uninterrupted space by landmarks in the sequence.
On the contrary, distributing the {\thethings} at one part of the sequence, as in skewed or symmetric, creates a wider space without {\thethings}.
This study provides us with different distance settings that we are going to use in the subsequent temporal privacy loss study.
Figure~\ref{fig:dist-cor} illustrates a comparison among the aforementioned distributions regarding the temporal privacy loss under (a)~weak, (b)~moderate, and (c)~strong temporal correlation degrees.
The line shows the overall privacy loss---for all cases of {\thething} distribution---without temporal correlation.
2021-07-18 17:31:05 +02:00
\begin{figure}[htp]
\centering
\subcaptionbox{Weak correlation\label{fig:dist-cor-wk}}{%
\includegraphics[width=.49\linewidth]{evaluation/dist-cor-wk}%
2021-07-18 17:31:05 +02:00
}%
\hfill
2021-10-13 09:03:04 +02:00
\\ \bigskip
2021-07-18 17:31:05 +02:00
\subcaptionbox{Moderate correlation\label{fig:dist-cor-mod}}{%
\includegraphics[width=.49\linewidth]{evaluation/dist-cor-mod}%
2021-07-18 17:31:05 +02:00
}%
\hfill
2021-07-18 17:31:05 +02:00
\subcaptionbox{Strong correlation\label{fig:dist-cor-stg}}{%
\includegraphics[width=.49\linewidth]{evaluation/dist-cor-stg}%
2021-07-18 17:31:05 +02:00
}%
\caption{
The temporal privacy loss (privacy budget $\varepsilon$)
% \kat{what is the unit for privacy loss? I t should appear on the diagram}
% \mk{It's the privacy budget epsilon}
for different {\thething} percentages and distributions under (a)~weak, (b)~moderate, and (c)~strong degrees of temporal correlation.
The line shows the overall privacy loss without temporal correlation.
}
2021-07-18 17:31:05 +02:00
\label{fig:dist-cor}
\end{figure}
In combination with Figure~\ref{fig:avg-dist}, we conclude that a greater average {\thething}--regular event
% \kat{it was even, I changed it to event but do not know what youo want ot say}
% \mk{Fixed}
distance in a distribution can result into greater temporal privacy loss under moderate and strong temporal correlation.
This is due to the fact that the backward/forward privacy loss accumulates more over time in wider spaces without {\thethings} (see Section~\ref{sec:correlation}).
2021-10-12 22:27:42 +02:00
Furthermore, the behavior of the privacy loss is as expected regarding the temporal correlation degree: a stronger correlation degree generates higher privacy loss while widening the gap between the different distribution cases.
On the contrary, a weaker correlation degree makes it harder to differentiate among the {\thething} distributions.
The privacy loss under a weak correlation degree converge
% \kat{with what?}
with all possible distributions for all {\thething} percentages.