This commit is contained in:
katerinatzo 2021-10-12 15:42:27 +02:00
parent f71422771e
commit e6bd6c1b7b
3 changed files with 38 additions and 23 deletions

View File

@ -1,70 +1,82 @@
\section{Details}
\section{Experimental Setting and Data Sets}
\label{sec:eval-dtl}
In this section we list all the relevant details regarding the setting of the evaluation (Section~\ref{subsec:eval-setup}), and the real and synthetic data sets that we used(Section~\ref{subsec:eval-dat}), along with the corresponding configurations (Section~\ref{subsec:eval-conf}).
In this section we list all the relevant details regarding the evaluation setting (Section~\ref{subsec:eval-setup}), and we present the real and synthetic data sets that we used (Section~\ref{subsec:eval-dat}), along with the corresponding configurations (Section~\ref{subsec:eval-conf}).
\subsection{Setting}
\label{subsec:eval-setup}
We implemented our experiments\footnote{Code available at \url{https://git.delkappa.com/manos/the-last-thing}} in Python $3$.$9$.$7$ and executed them on a machine with an Intel i$7$-$6700$HQ at $3$.$5$GHz CPU and $16$GB RAM, running Manjaro Linux $21$.$1$.$5$.
We repeated each experiment $100$ times and we report the mean over these iterations.
We repeated each experiment $100$ times and we report the mean over these iterations. \kat{It could be interested to report also on the diagrams the std}
\subsection{Data sets}
\label{subsec:eval-dat}
\subsubsection{Real}
\subsubsection{Real Data Sets}
\paragraph{Copenhagen}~\cite{sapiezynski2019interaction}
data set that was collected via the smartphone devices of $851$ university students over a period of $4$ week as part of the Copenhagen Networks Study.
data set was collected via the smartphone devices of $851$ university students over a period of $4$ week as part of the Copenhagen Networks Study.
Each device was configured to be discoverable by and to discover nearby Bluetooth devices every $5$ minutes.
Upon discovery each device registers, (i)~the timestamp in seconds, (ii)~the device's unique identifier, (iii)~the unique identifier of the device that it discovered ($- 1$ when no device was found or $- 2$ for any non-participating device), and (iv)~the Received Signal Strength Indicator (RSSI) in dBm.
Upon discovery, each device registers (i)~the timestamp in seconds, (ii)~the device's unique identifier, (iii)~the unique identifier of the device that it discovered ($- 1$ when no device was found or $- 2$ for any non-participating device), and (iv)~the Received Signal Strength Indicator (RSSI) in dBm.
Half of the devices have registered data at at least $81\%$ of the possible timestamps.
From this data set, we utilized the $1,000$ first contacts out of $12,167$ valid unique contacts of the device with identifier `$449$'.
From this data set, we utilized the $1,000$ first contacts out of $12,167$ valid unique contacts of the device with identifier `$449$'. \kat{why only the 1000 first contacts? why device 449? why only one device and not multiple ones, and then report the mean?}
\paragraph{HUE}~\cite{makonin2018hue}
contains the hourly energy consumption data of $22$ residential customers of BCHydro, a provincial power utility, in British Columbia.
contains the hourly energy consumption data of $22$ residential customers of BCHydro, a provincial power utility in British Columbia.
The measurements for each residence are saved individually and each measurement contains (i)~the date (YYYY-MM-DD), (ii)~the hour, and (iii)~the energy consumption in kWh.
In our experiments, we used the first $1,000$ out of $29,231$ measurements of the residence with identifier `$1$', average energy consumption equal to $0.88$kWh, and value range $[0.28$, $4.45]$.
In our experiments, we used the first $1,000$ out of $29,231$ measurements of the residence with identifier `$1$', average energy consumption equal to $0.88$kWh, and value range $[0.28$, $4.45]$. \kat{again, explain your choices. Moreover, you make some conclusions later on, based on the characteristics of the data set, for example the density of the measurement values. You should describe all these characteristics in these paragraphs.}
\paragraph{T-drive}~\cite{yuan2010t}
consists of $15$ million GPS data points of the trajectories of $10,357$ taxis in Beijing, spanning a period of $1$ week and a total distance of $9$ million kilometers.
The taxis reported their location data on average every $177$ seconds and $623$ meters approximately.
Each vehicle registers (i)~the taxi unique identifier, (ii)~the timestamp (YYYY-MM-DD HH:MM:SS), (iii)~longitude, and (iv)~latitude.
These measurements are stored individually per vehicle.
We sampled the first $1000$ data items of the taxi with identifier `$2$'.
We sampled the first $1000$ data items of the taxi with identifier `$2$'.\kat{again, explain your choices}
\subsubsection{Synthetic}
We generated synthetic time series of length equal to $100$ timestamps, for which we varied the number and distribution of {\thethings}.
We take into account only the temporal order of the points and the position of regular and {\thething} events within the series.
We generated synthetic time series of length equal to $100$ timestamps, for which we varied the number and distribution of {\thethings}.
In this way, we have a controlled data set that we can use to study the behaviour of our proposal.
\kat{more details needed. eg. what is the distributions and number of timestamps used? How many time series you generated? }
We take into account only the temporal order of the points and the position of regular and {\thething} events within the series. \kat{why is the value not important? at the energy consumption, they mattered}
\subsection{Configurations}
\label{subsec:eval-conf}
\kat{add some info here.. what are the configurations for? What does landmark percentage refer to, and how does it matter? }
\subsubsection{{\Thething} percentage}
For the Copenhagen data set, we achieve
In the Copenhagen data set, a landmark represents a time-stamp when a contact device is registered.
We achieve
$0\%$ {\thethings} by considering an empty list of contact devices,
$20\%$ by extending the list with $[3$, $6$, $11$, $12$, $25$, $29$, $36$, $39$, $41$, $46$, $47$, $50$, $52$, $56$, $57$, $61$, $63$, $78$, $80]$,
$40\%$ with $[81$, $88$, $90$, $97$, $101$, $128$, $130$, $131$, $137$, $145$, $146$, $148$, $151$, $158$, $166$, $175$, $176]$,
$60\%$ with $[181$, $182$, $192$, $195$, $196$, $201$, $203$, $207$, $221$, $230$, $235$, $237$, $239$, $241$, $254]$,
$80\%$ with $[260$, $282$, $287$, $289$, $290$, $291$, $308$, $311$, $318$, $323$, $324$, $330$, $334$, $335$, $344$, $350$, $353$, $355$, $357$, $358$, $361$, $363]$, and
$100\%$ by including all of the possible contacts.
\kat{How did you decide which devices to add at each point?}
In HUE, we get $0$\%, $20$\% $40$\%, $60$\%, $80$\%, and $100$\% {\thethings} by setting the energy consumption threshold below $0.28$kWh, $1.12$kWh, $0.88$kWh, $0.68$kWh, $0.54$kWh, $4.45$kWh respectively.
\kat{Say what time-stamps are landmarks in this data set. What is the consumption threshld?}In HUE, we get $0$\%, $20$\% $40$\%, $60$\%, $80$\%, and $100$\% {\thethings} by setting the energy consumption threshold below $0.28$kWh, $1.12$kWh, $0.88$kWh, $0.68$kWh, $0.54$kWh, $4.45$kWh respectively.
In T-drive, we achieved the desired {\thething} percentages by utilizing the method of Li et al.~\cite{li2008mining} for detecting stay points in trajectory data.
In T-drive, a landmark represents the time-stamp of a stay point. We achieved the desired {\thething} percentages by utilizing the method of Li et al.~\cite{li2008mining} for detecting stay points in trajectory data.
In more detail, the algorithm checks for each data item if each subsequent item is within a given distance threshold $\Delta l$ and measures the time period $\Delta t$ between the present point and the last subsequent point.
We achieve $0$\%, $20$\% $40$\%, $60$\%, $80$\%, and $100$\% {\thethings} by setting the ($\Delta l$ in meters, $\Delta t$ in minutes) pairs input to the stay point discovery method as [($0$, $1000$), ($2095$, $30$), ($2790$, $30$), ($3590$, $30$), ($4825$, $30$), ($10350$, $30$)].
We achieve $0$\%, $20$\% $40$\%, $60$\%, $80$\%, and $100$\% {\thethings} by setting the ($\Delta l$ in meters, $\Delta t$ in minutes) pairs input to the stay point discovery method as [($0$, $1000$), ($2095$, $30$), ($2790$, $30$), ($3590$, $30$), ($4825$, $30$), ($10350$, $30$)]. \kat{how did you come up with these numbers?}
We generated synthetic data with \emph{skewed} (the {\thethings} are distributed towards the beginning/end of the series), \emph{symmetric} (in the middle), \emph{bimodal} (both end and beginning), and \emph{uniform} (all over the time series) {\thething} distributions.
In order to get {\thethings} with the above distribution features, we generate probability distributions with appropriate characteristics and sample from them, without replacement, the desired number of points.
%The generated distributions are representative of the cases that we wish to examine during the experiments.
For example, for a left-skewed {\thething} distribution we would utilize a truncated distribution resulting from the restriction of the domain of a distribution to the beginning and end of the time series with its location shifted to the center of the right half of the series.
For a left-skewed {\thething} distribution we would utilize a truncated distribution resulting from the restriction of the domain of a distribution to the beginning and end of the time series with its location shifted to the center of the right half of the series.
For a right-skewed ....
For a symmetric ..
For a bimodal ..
For uniform ...
\kat{repeat for all kinds of distributions}
For consistency, we calculate the scale parameter depending on the length of the series by setting it equal to the series' length over a constant.
\kat{The following paragraph does not belong in this section..}
Notice that in our experiments, in the cases when we have $0\%$ and $100\%$ of the events being {\thethings}, we get the same behavior as in event- and user-level privacy respectively.
This happens due the fact that at each timestamp we take into account only the data items at the current timestamp and ignore the rest of the time series (event-level) when there are no {\thethings}.
Whereas, when each timestamp corresponds to a {\thething} we consider and protect all the events throughout the entire series (user-level).
@ -72,17 +84,20 @@ Whereas, when each timestamp corresponds to a {\thething} we consider and protec
\subsubsection{Privacy parameters}
\kat{Explain why you select each of these perturbation mechanisms for each of the datasets. Is the random response differential private? Mention it! }
To perturb the contact tracing data of the Copenhagen data set, we utilize the \emph{random response} technique~\cite{wang2017locally} to report with probability $p = \frac{e^\varepsilon}{e^\varepsilon + 1}$ whether the current contact is a {\thething} or not.
We randomize the energy consumption in HUE with the Laplace mechanism (described in detail in Section~\ref{subsec:prv-mech}).
We inject noise to the spatial values in T-drive that we sample from the Planar Laplace mechanism~\cite{andres2013geo}.
We set the privacy budget $\varepsilon = 1$, and, for simplicity, we assume that for every query sensitivity it holds that $\Delta f = 1$.
For the experiments performed on the synthetic data sets, the original values to be released do not influence the outcome of our conclusions, thus we ignore them.
We set the privacy budget $\varepsilon = 1$, and, for simplicity, we assume that for every query sensitivity it holds that $\Delta f = 1$. \kat{why don't you consider other values as well?}
For the experiments performed on the synthetic data sets, the original values to be released do not influence the outcome of our conclusions, thus we ignore them.
\kat{why are the values not important for the synthetic dataset? This seems a little weird, when said out of context.. our goal is to perturb the values, but do not really care about the way we perturb our values?}
% Finally, notice that, depending on the results' variation, most diagrams are in logarithmic scale.
\subsubsection{Temporal correlation}
\kat{Did you find any correlation in the other data? Do you need the correlation matrix to be known a priori? Describe a little why you did not use the real data for correlations }
We model the temporal correlation in the synthetic data as a \emph{stochastic matrix} $P$, using a \emph{Markov Chain}~\cite{gagniuc2017markov}.
$P$ is a $n \times n$ matrix, where the element $P_{ij}$
%at the $i$th row of the $j$th column that

View File

@ -1,11 +1,11 @@
\chapter{Evaluation}
\label{ch:eval}
In this chapter we present the experiments that we performed, to evaluate the methodology that we introduced in Chapter~\ref{ch:lmdk-prv}, on real and synthetic data sets.
Section~\ref{sec:eval-dtl} contains all the details regarding the data sets the we utilized for our experiments (Section~\ref{subsec:eval-dat}) along with the parameter configurations.
In this chapter we present the experiments that we performed in order to evaluate {\thething} Privacy (Chapter~\ref{ch:lmdk-prv}) on real and synthetic data sets.
Section~\ref{sec:eval-dtl} contains all the details regarding the data sets the we used for our experiments along with the system configurations.
Section~\ref{sec:eval-lmdk} evaluates the data utility of the {\thething} privacy mechanisms that we designed in Section~\ref{sec:thething} and investigates the behavior of the privacy loss under temporal correlation for different distributions of {\thethings}.
Section~\ref{sec:eval-lmdk-sel} justifies our decisions while designing the privacy-preserving {\thething} selection component in Section~\ref{sec:theotherthing} and the data utility impact of the latter.
Finally, Section~\ref{sec:eval-sum} concludes this chapter by summarizing the main takeaways of the results of the experiments that we performed.
Finally, Section~\ref{sec:eval-sum} concludes this chapter by summarizing the main results derived from the experiments.
\input{evaluation/details}
\input{evaluation/thething}

View File

@ -1,7 +1,7 @@
\section{Summary}
\label{sec:eval-sum}
In this chapter we presented the experimental evaluation of the {\thething} privacy mechanisms and privacy-preserving {\thething} selection mechanism, that we developed in Chapter~\ref{ch:lmdk-prv}, on real and synthetic data sets.
In this chapter we presented the experimental evaluation of the {\thething} privacy mechanisms and privacy-preserving {\thething} selection mechanism that we developed in Chapter~\ref{ch:lmdk-prv}, on real and synthetic data sets.
The Adaptive mechanism is the most reliable and best performing mechanism, in terms of overall data utility, with minimal tuning across most cases.
Skip performs optimally in data sets with a lower value range where approximation fits best.
The {\thething} selection component introduces a reasonable data utility decline to all of our mechanisms however, the Adaptive handles it well and bounds the data utility to higher levels compared to user-level protection.