diff --git a/text/evaluation/details.tex b/text/evaluation/details.tex index 5ae5a58..fae9afe 100644 --- a/text/evaluation/details.tex +++ b/text/evaluation/details.tex @@ -53,7 +53,7 @@ We sampled the first $1000$ data items of the taxi with identifier `$2$', which We generated synthetic time series of length equal to $100$ timestamps, for which we varied the number and distribution of {\thethings}. In this way, we have a controlled data set that we can use to study the behavior of our proposal. % \kat{more details needed. eg. what is the distributions and number of timestamps used? How many time series you generated? } -We take into account only the temporal order of the points and the position of regular and {\thething} events within the series. +We take into account only the temporal order of the points and the position of regular and {\thething} events within the time series. In Section~\ref{subsec:eval-conf}, we explain in more detail our configuration criteria. % \kat{why is the value not important? at the energy consumption, they mattered} @@ -61,14 +61,14 @@ In Section~\ref{subsec:eval-conf}, we explain in more detail our configuration c \subsection{Configurations} \label{subsec:eval-conf} % \kat{add some info here.. what are the configurations for? What does landmark percentage refer to, and how does it matter? } -We vary the {\thething} percentage (Section~\ref{subsec:eval-conf-lmdk}), i.e.,~the ratio of timestamps that we attribute to {\thethings} and regular events, in order to identify the limitations of our methodology. +We vary the {\thething} percentage (Section~\ref{subsec:eval-conf-lmdk}), i.e.,~the ratio of timestamps that we attribute to {\thethings} and regular events, in order to explore the behavior of our methodology in all possible scenarios. For each data set, we implement a privacy mechanism that injects noise related to the type of its attribute values and we tune the parameters of each mechanism accordingly (Section~\ref{subsec:eval-conf-prv}). -Last but not least, we explain how we generate synthetic data sets with the desired degree of temporal correlation (Section~\ref{subsec:eval-conf-cor}). +Last but not least, we explain how we generate synthetic data sets with various degrees of temporal correlation so as to observe the impact on the overall privacy loss (Section~\ref{subsec:eval-conf-cor}). \subsubsection{{\Thething} percentage} \label{subsec:eval-conf-lmdk} -In the Copenhagen data set, a {\thething} represents a timestamp when a contact device is registered. +In the Copenhagen data set, a {\thething} represents a timestamp when a specific contact device is registered. After identifying the unique contacts within the sample, we achieve each desired {\thethings} to regular events ratio by considering a list that contains a part of these contact devices. In more detail, we achieve $0\%$ {\thethings} by considering an empty list of contact devices, @@ -94,23 +94,23 @@ After analyzing the data and experimenting with different pairs of distance and We generated synthetic data with \emph{skewed} (the {\thethings} are distributed towards the beginning/end of the series), \emph{symmetric} (in the middle), \emph{bimodal} (both end and beginning), and \emph{uniform} (all over the time series) {\thething} distributions. In order to get {\thething} sets with the above distribution features, we generate probability distributions with restricted domain to the beginning and end of the time series, and sample from them, without replacement, the desired number of points. For each case, we place the location, i.e.,~centre, of the distribution accordingly. -That is, for a symmetric we put the location in the middle of the time series and for a left/right skewed to the right/left. -For the bimodal we combine two mirrored skewed distributions. +That is, for symmetric we put the location in the middle of the time series and for left/right skewed to the right/left. +For bimodal we combine two mirrored skewed distributions. Finally, for the uniform distribution we distribute the {\thethings} randomly throughout the time series. -For consistency, we calculate the scale parameter depending on the length of the series by setting it equal to the series' length over a constant. +For consistency, we calculate the scale parameter of the corresponding distribution depending on the length of the time series by setting it equal to the series' length over a constant. \subsubsection{Privacy parameters} \label{subsec:eval-conf-prv} % \kat{Explain why you select each of these perturbation mechanisms for each of the datasets. Is the random response differential private? Mention it! } -For all of te real data sets, we implement $\varepsilon$-differential privacy. +For all of te real data sets, we implement $\varepsilon$-differential privacy by selecting a mechanism, from those that we described in Section~\ref{subsec:prv-mech}, that is best suited for the type of its sensitive attributes. To perturb the contact tracing data of the Copenhagen data set, we utilize the \emph{random response} technique~\cite{wang2017locally}, and at each timestamp we report truthfully, with probability $p = \frac{e^\varepsilon}{e^\varepsilon + 1}$, whether the current contact is a {\thething} or not. -We randomize the energy consumption in HUE with the Laplace mechanism (described in detail in Section~\ref{subsec:prv-mech}). -For T-drive, we perturb the location data wit noise that we sample from the Planar Laplace mechanism~\cite{andres2013geo}. +We randomize the energy consumption in HUE with the Laplace mechanism~\cite{dwork2014algorithmic}. +For T-drive, we perturb the location data with noise that we sample from the Planar Laplace mechanism~\cite{andres2013geo}. We set the privacy budget $\varepsilon = 1$ for all of our experiments and, for simplicity, we assume that for every query sensitivity it holds that $\Delta f = 1$. % \kat{why don't you consider other values as well?} -For the experiments performed on the synthetic data sets, the original values to be released do not influence the outcome of our conclusions, thus we ignore them. +For the experiments that we performed on the synthetic data sets, the original values to be released are not relevant to what we we to observe, and thus we ignore them. % \kat{why are the values not important for the synthetic dataset? This seems a little weird, when said out of context.. our goal is to perturb the values, but do not really care about the way we perturb our values?} % Finally, notice that, depending on the results' variation, most diagrams are in logarithmic scale. @@ -121,12 +121,12 @@ For the experiments performed on the synthetic data sets, the original values to Despite the inherent presence of temporal correlation in time series, it is challenging to correctly discover and quantify it. For this reason, and in order to create a more controlled environment for our experiments, we chose to conduct tests relevant to temporal correlation using synthetic data sets. We model the temporal correlation in the synthetic data as a \emph{stochastic matrix} $P$, using a \emph{Markov Chain}~\cite{gagniuc2017markov}. -$P$ is a $n \times n$ matrix, where the element $P_{ij}$ +$P$ is an $n \times n$ matrix, where the element $P_{ij}$ %at the $i$th row of the $j$th column that represents the transition probability from a state $i$ to another state $j$. %, $\forall i, j \leq n$. It holds that the elements of every row $j$ of $P$ sum up to $1$. -We follow the \emph{Laplacian smoothing} technique~\cite{sorkine2004laplacian} as utilized in~\cite{cao2018quantifying} to generate the matrix $P$ with a degree of temporal correlation $s > 0$ equal to +We follow the \emph{Laplacian smoothing} technique~\cite{sorkine2004laplacian}, as utilized in~\cite{cao2018quantifying}, to generate the matrix $P$ with a degree of temporal correlation $s > 0$ equal to % and generate a stochastic matrix $P$ with a degree of temporal correlation $s$ by calculating each element $P_{ij}$ as follows $$\frac{(I_{n})_{ij} + s}{\sum_{k = 1}^{n}((I_{n})_{jk} + s)}$$ where $I_{n}$ is an \emph{identity matrix} of size $n$. diff --git a/text/evaluation/main.tex b/text/evaluation/main.tex index fae6cd1..7913412 100644 --- a/text/evaluation/main.tex +++ b/text/evaluation/main.tex @@ -3,7 +3,7 @@ In this chapter we present the experiments that we performed in order to evaluate {\thething} privacy (Chapter~\ref{ch:lmdk-prv}) on real and synthetic data sets. Section~\ref{sec:eval-dtl} contains all the details regarding the data sets the we used for our experiments along with the system configurations. Section~\ref{sec:eval-lmdk} evaluates the data utility of the {\thething} privacy mechanisms that we designed in Section~\ref{sec:thething} and investigates the behavior of the privacy loss under temporal correlation for different distributions of {\thethings}. -Section~\ref{sec:eval-lmdk-sel} justifies our decisions while designing the privacy-preserving {\thething} selection component in Section~\ref{sec:theotherthing} and the data utility impact of the latter. +Section~\ref{sec:eval-lmdk-sel} justifies our decisions while designing the privacy-preserving {\thething} selection mechanism in Section~\ref{sec:theotherthing} and the data utility impact of the latter. Finally, Section~\ref{sec:eval-sum} concludes this chapter by summarizing the main results derived from the experiments. \input{evaluation/details} diff --git a/text/evaluation/theotherthing.tex b/text/evaluation/theotherthing.tex index b21c15c..90cb20c 100644 --- a/text/evaluation/theotherthing.tex +++ b/text/evaluation/theotherthing.tex @@ -1,11 +1,11 @@ \section{Selection of landmarks} \label{sec:eval-lmdk-sel} -In this section, we present the experiments on the methodology for the {\thethings} selection presented in Section~\ref{subsec:lmdk-sel-sol}, on the real and synthetic data sets. +In this section, we present the experiments on the methodology for the {\thething} selection presented in Section~\ref{subsec:lmdk-sel-sol}, on the real and synthetic data sets. With the experiments on the synthetic data sets (Section~\ref{subsec:sel-utl}) we show the normalized Euclidean and Wasserstein distance metrics (not to be confused with the temporal distances in Figure~\ref{fig:avg-dist}) % \kat{is this distance the landmark distance that we saw just before ? clarify } of the time series histograms for various distributions and {\thething} percentages. This allows us to justify our design decisions for our concept that we showcased in Section~\ref{subsec:lmdk-sel-sol}. -With the experiments on the real data sets (Section~\ref{subsec:sel-prv}), we show the performance in terms of utility of our three {\thething} mechanisms in combination with the privacy-preserving {\thething} selection mechanism, which enhances the privacy protection of our concept. +With the experiments on the real data sets (Section~\ref{subsec:sel-prv}), we show the performance in terms of utility of our three {\thething} mechanisms in combination with the privacy-preserving {\thething} selection mechanism, which enhances the privacy protection that our concept provides. % \kat{Mention whether it improves the original proposal or not.} @@ -37,7 +37,7 @@ Thus, we choose to utilize the Euclidean distance metric for the implementation \subsection{Privacy budget tuning} \label{subsec:sel-eps} -In Figure~\ref{fig:sel-eps} we test the Uniform mechanism in real data by investing different ratios ($1$\%, $10$\%, $25$\%, and $50$\%) of the available privacy budget $\varepsilon$ in the {\thething} selection mechanism and the remaining to perturbing the data values, in order to figure out the optimal ratio value. +In Figure~\ref{fig:sel-eps} we test the Uniform mechanism in real data by investing different ratios ($1$\%, $10$\%, $25$\%, and $50$\%) of the available privacy budget $\varepsilon$ in the {\thething} selection mechanism and the remaining to perturbing the original data values, in order to figure out the optimal ratio value. Uniform is our baseline implementation, and hence allows us to derive more accurate conclusions in this case. In general, we are expecting to observe that greater ratios will result in more accurate, i.e.,~smaller, {\thething} sets and less accurate values in the released data. @@ -59,8 +59,8 @@ In general, we are expecting to observe that greater ratios will result in more \label{fig:sel-eps} \end{figure} -The application of the randomized response mechanism, in the Copenhagen data set, is tolerant to the fluctuations of the privacy budget and maintains a relatively constant performance in terms of data utility. -For HUE and T-drive, we observe that our implementation performs better for lower ratios, e.g.,~$0.01$, where we end up allocating the majority of the available privacy budget to the data release process instead of the {\thething} selection mechanism. +The application of the randomized response mechanism, in the Copenhagen data set (Figure~\ref{fig:copenhagen-sel-eps}), is tolerant to the fluctuations of the privacy budget and maintains a relatively constant performance in terms of data utility. +For HUE (Figure~\ref{fig:hue-sel-eps}) and T-drive (Figure~\ref{fig:t-drive-sel-eps}), we observe that our implementation performs better for lower ratios, e.g.,~$0.01$, where we end up allocating the majority of the available privacy budget to the data release process instead of the {\thething} selection mechanism. The results of this experiment indicate that we can safely allocate the majority of $\varepsilon$ for publishing the data values, and therefore achieve better data utility, while providing more robust privacy protection to the {\thething} set. @@ -94,7 +94,7 @@ This is natural since we allocated part of the available privacy budget to the p Therefore, there is less privacy budget available for data publishing throughout the time series. % for $0$\% and $100$\% {\thethings}. % \kat{why not for the other percentages?} -Skip performs best in our experiments with HUE, due to the low range in the energy consumption and the high scale of the Laplace noise that it avoids due to the employed approximation. -However, for the Copenhagen data set and T-drive, Skip attains greater mean absolute error than the user-level protection scheme, which exposes no benefit with respect to user-level protection. -Overall, Adaptive has a consistent performance in terms of utility for all of the data sets that we experimented with, and almost always outperforms the user-level privacy. +Skip performs best in our experiments with HUE (Figure~\ref{fig:hue-sel}), due to the low range in the energy consumption and the high scale of the Laplace noise that it avoids due to the employed approximation. +However, for the Copenhagen data set (Figure~\ref{fig:copenhagen-sel}) and T-drive (Figure~\ref{fig:t-drive-sel}), Skip attains high mean absolute error, which exposes no benefit with respect to user-level protection. +Overall, Adaptive has a consistent performance in terms of utility for all of the data sets that we experimented with, and almost always outperforms the user-level privacy protection. Thus, it is selected as the best mechanism to use in general. diff --git a/text/evaluation/thething.tex b/text/evaluation/thething.tex index 2bbb11f..3fd11e7 100644 --- a/text/evaluation/thething.tex +++ b/text/evaluation/thething.tex @@ -10,7 +10,7 @@ We compare with the event- and user-level differential privacy protection levels With the experiments on the synthetic data sets (Section~\ref{subsec:lmdk-expt-cor}) we show the overall privacy loss, % \kat{in the previous set of experiments we were measuring the MAE, now we are measuring the privacy loss... Why is that? Isn't it two sides of the same coin? } -i.e.,~the privacy budget $\varepsilon$, under temporal correlation within our framework when tuning the size and statistical characteristics of the input {\thething} set $L$. +i.e.,~the privacy budget $\varepsilon$ with the extra privacy loss because of the temporal correlation, under temporal correlation within our framework when tuning the size and statistical characteristics of the input {\thething} set $L$. % \kat{mention briefly what you observe} We observe that a greater average {\thething}--regular event distance in a time series can result into greater overall privacy loss under moderate and strong temporal correlation. @@ -52,19 +52,19 @@ overall consistent performance and works best for $60$\% and $80$\% {\thethings} We notice that for $0$\% {\thethings}, it achieves better utility than the event-level protection % \kat{what does this mean? how is it possible?} due to the combination of more available privacy budget per timestamp (due to the absence of {\thethings}) and its adaptive sampling methodology. -The Skip model excels, compared to the others, at cases where it needs to approximate $20$\%, $40$\%, or $100$\% of the times. +Skip excels, compared to the others, at cases where it needs to approximate $20$\%, $40$\%, or $100$\% of the times. % \kat{it seems a little random.. do you have an explanation? (rather few times or all?)} -In general, we notice that, for this data set, it is more beneficial to either invest more privacy budget per event or prefer approximation over introducing randomization. +In general, we notice that, for this data set and due to the application of the random response technique, it is more beneficial to either invest more privacy budget per event or prefer approximation over introducing randomization. -The combination of the small range of measurements in HUE ($[0.28$, $4.45]$ with an average of $0.88$kWh) and the large scale in the Laplace mechanism, allows for schemes that favor approximation over noise injection to achieve a better performance in terms of data utility. -Hence, Skip (Figure~\ref{fig:hue}) achieves a constant low mean absolute error. +The combination of the small range of measurements ($[0.28$, $4.45]$ with an average of $0.88$kWh) in HUE (Figure~\ref{fig:hue}) and the large scale in the Laplace mechanism, allows for mechanisms that favor approximation over noise injection to achieve a better performance in terms of data utility. +Hence, Skip achieves a constant low mean absolute error. % \kat{why?explain} Regardless, the Adaptive mechanism performs by far better than Uniform and % strikes a nice balance\kat{???} balances between event- and user-level protection for all {\thething} percentages. -In the T-drive data set (Figure~\ref{fig:t-drive}), the Adaptive mechanism outperforms Uniform by $10$\%--$20$\% for all {\thething} percentages greater than $40$\% and Skip by more than $20$\%. -The lower density (average distance of $623$m) of the T-drive data set has a negative impact on the performance of Skip; republishing a previous perturbed value is now less accurate than perturbing the new location. +In T-drive (Figure~\ref{fig:t-drive}), the Adaptive mechanism outperforms Uniform by $10$\%--$20$\% for all {\thething} percentages greater than $40$\% and Skip by more than $20$\%. +The lower density (average distance of $623$m) of the T-drive data set has a negative impact on the performance of Skip because republishing a previous perturbed value is now less accurate than perturbing the current location. Principally, we can claim that the Adaptive is the most reliable and best performing mechanism, % with a minimal and generic parameter tuning @@ -83,11 +83,11 @@ result in a more effective budget allocation that would improve the performance \subsection{Temporal distance and correlation} \label{subsec:lmdk-expt-cor} -As previously mentioned, temporal correlation is inherent in continuous publishing, and they are the cause of supplementary privacy loss in the case of privacy-preserving data publication. -In this section, we are interested in studying the effect that the distance of the {\thethings} from every event have on the loss caused by temporal correlation. +As previously mentioned, temporal correlation is inherent in continuous publishing, and it is the cause of supplementary privacy loss in the case of privacy-preserving time series publishing. +In this section, we are interested in studying the effect that the distance of the {\thethings} from every regular event has on the loss caused under the presence of temporal correlation. Figure~\ref{fig:avg-dist} shows a comparison of the average temporal distance of the events from the previous/next {\thething} or the start/end of the time series for various distributions in our synthetic data. -More specifically, we model the distance of an event as the count of the total number of events between itself and the nearest {\thething} or the series edge. +More specifically, we model the distance of an event as the count of the total number of events between itself and the nearest {\thething} or the time series edge. \begin{figure}[htp] \centering @@ -100,7 +100,7 @@ We observe that the uniform and bimodal distributions tend to limit the regular This is due to the fact that the former scatters the {\thethings}, while the latter distributes them on both edges, leaving a shorter space uninterrupted by {\thethings}. % and as a result they reduce the uninterrupted space by landmarks in the sequence. On the contrary, distributing the {\thethings} at one part of the sequence, as in skewed or symmetric, creates a wider space without {\thethings}. -This study provides us with different distance settings that we are going to use in the subsequent temporal leakage study. +This study provides us with different distance settings that we are going to use in the subsequent overall privacy loss study. Figure~\ref{fig:dist-cor} illustrates a comparison among the aforementioned distributions regarding the overall privacy loss under (a)~weak, (b)~moderate, and (c)~strong temporal correlation degrees. The line shows the overall privacy loss---for all cases of {\thething} distribution---without temporal correlation.