diff --git a/text/evaluation/details.tex b/text/evaluation/details.tex index e27fea9..e327d45 100644 --- a/text/evaluation/details.tex +++ b/text/evaluation/details.tex @@ -1,6 +1,6 @@ \section{Setting, configurations, and data sets} \label{sec:eval-dtl} -In this section we list all the relevant details regarding the evaluation setting (Section~\ref{subsec:eval-setup}), and we present the real and synthetic data sets that we used (Section~\ref{subsec:eval-dat}), along with the corresponding configurations (Section~\ref{subsec:eval-conf}). +In this section we list all the relevant details regarding the evaluation setting (Section~\ref{subsec:eval-setup}), and we present the real and synthetic data sets that we used (Section~\ref{subsec:eval-dat}) along with the corresponding configurations (Section~\ref{subsec:eval-conf}). \subsection{Machine setup} @@ -21,7 +21,7 @@ We performed experiments on real (Section~\ref{subsec:eval-dat-real}) and synthe For uniformity and in order to be consistent, we sample from each of the following data sets the first $1,000$ entries that satisfy the configuration criteria that we discuss in detail in Section~\ref{subsec:eval-conf}. \paragraph{Copenhagen}~\cite{sapiezynski2019interaction} -data set was collected via the smartphone devices of $851$ university students over a period of $4$ week as part of the Copenhagen Networks Study. +data set was collected via the smartphone devices of $851$ university students over a period of $4$ weeks as part of the Copenhagen Networks Study. Each device was configured to be discoverable by and to discover nearby Bluetooth devices every $5$ minutes. Upon discovery, each device registers (i)~the timestamp in seconds, (ii)~the device's unique identifier, (iii)~the unique identifier of the device that it discovered ($- 1$ when no device was found or $- 2$ for any non-participating device), and (iv)~the Received Signal Strength Indicator (RSSI) in dBm. Half of the devices have registered data at at least $81\%$ of the possible timestamps. @@ -41,14 +41,14 @@ In those entries, out of a total of $29,231$ measurements, we estimated an avera \paragraph{T-drive}~\cite{yuan2010t} consists of $15$ million GPS data points of the trajectories of $10,357$ taxis in Beijing, spanning a period of $1$ week and a total distance of $9$ million kilometers. The taxis reported their location data on average every $177$ seconds and $623$ meters approximately. -Each vehicle registers (i)~the taxi unique identifier, (ii)~the timestamp (YYYY-MM-DD HH:MM:SS), (iii)~longitude, and (iv)~latitude. +Each vehicle registers (i)~the taxi unique identifier, (ii)~the timestamp (YYYY-MM-DD HH:MM:SS), (iii)~the longitude, and (iv)~the latitude. These measurements are stored individually per vehicle. -We sampled the first $1000$ data items of the taxi with identifier `$2$', which satisfied our configuration criteria (Section~\ref{subsec:eval-conf}). +We sampled the first $1,000$ data items of the taxi with identifier `$2$', which satisfied our configuration criteria (Section~\ref{subsec:eval-conf}). % \kat{again, explain your choices} % \mk{OK} -\subsubsection{Synthetic} +\subsubsection{Synthetic data sets} \label{subsec:eval-dat-syn} We generated synthetic time series of length equal to $100$ timestamps, for which we varied the number and distribution of {\thethings}. In this way, we have a controlled data set that we can use to study the behavior of our proposal. @@ -83,7 +83,7 @@ $100\%$ by including all of the possible contacts. % \kat{Say what time-stamps are landmarks in this data set. What is the consumption threshld?} % \mk{OK} In HUE, we consider as {\thethings} the events that have energy consumption values below a certain threshold. -That is, we get $0$\%, $20$\% $40$\%, $60$\%, $80$\%, and $100$\% {\thethings} by setting the energy consumption threshold at $0.28$kWh, $1.12$kWh, $0.88$kWh, $0.68$kWh, $0.54$kWh, and $4.45$kWh respectively. +That is, we get $0$\%, $20$\% $40$\%, $60$\%, $80$\%, and $100$\% {\thethings} by setting the energy consumption threshold to $0.28$kWh, $1.12$kWh, $0.88$kWh, $0.68$kWh, $0.54$kWh, and $4.45$kWh respectively. In T-drive, a {\thething} represents a location where a vehicle spend some time. We achieved the desired {\thething} percentages by utilizing the method of Li et al.~\cite{li2008mining} for detecting stay points in trajectory data. @@ -93,7 +93,7 @@ After analyzing the data and experimenting with different pairs of distance and We generated synthetic data with \emph{skewed} (the {\thethings} are distributed towards the beginning/end of the series), \emph{symmetric} (in the middle), \emph{bimodal} (both end and beginning), and \emph{uniform} (all over the time series) {\thething} distributions. In order to get {\thething} sets with the above distribution features, we generate probability distributions with restricted domain to the beginning and end of the time series, and sample from them, without replacement, the desired number of points. -For each case, we place the location, i.e.,~centre, of the distribution accordingly. +For each case, we place the location (centre) of the distribution accordingly. That is, for symmetric we put the location in the middle of the time series and for left/right skewed to the right/left. For bimodal we combine two mirrored skewed distributions. Finally, for the uniform distribution we distribute the {\thethings} randomly throughout the time series. @@ -103,14 +103,14 @@ For consistency, we calculate the scale parameter of the corresponding distribut \subsubsection{Privacy parameters} \label{subsec:eval-conf-prv} % \kat{Explain why you select each of these perturbation mechanisms for each of the datasets. Is the random response differential private? Mention it! } -For all of te real data sets, we implement $\varepsilon$-differential privacy by selecting a mechanism, from those that we described in Section~\ref{subsec:prv-mech}, that is best suited for the type of its sensitive attributes. -To perturb the contact tracing data of the Copenhagen data set, we utilize the \emph{random response} technique~\cite{wang2017locally}, and at each timestamp we report truthfully, with probability $p = \frac{e^\varepsilon}{e^\varepsilon + 1}$, whether the current contact is a {\thething} or not. +For all of the real data sets, we implement $\varepsilon$-differential privacy by selecting a mechanism, from those that we described in Section~\ref{subsec:prv-mech}, that is best suited for the type of its sensitive attributes. +To perturb the contact tracing data of the Copenhagen data set, we utilize the \emph{random response} technique~\cite{wang2017locally} and we report truthfully at each timestamp, with probability $p = \frac{e^\varepsilon}{e^\varepsilon + 1}$, whether the current contact is a {\thething} or not. We randomize the energy consumption in HUE with the Laplace mechanism~\cite{dwork2014algorithmic}. -For T-drive, we perturb the location data with noise that we sample from the Planar Laplace mechanism~\cite{andres2013geo}. +For T-drive, we perturb the location data with noise that we sample from a Planar Laplace distribution~\cite{andres2013geo}. We set the privacy budget $\varepsilon = 1$ for all of our experiments and, for simplicity, we assume that for every query sensitivity it holds that $\Delta f = 1$. % \kat{why don't you consider other values as well?} -For the experiments that we performed on the synthetic data sets, the original values to be released are not relevant to what we we to observe, and thus we ignore them. +For the experiments that we performed on the synthetic data sets, the original values to be released are not relevant to what we want to observe, and thus we ignore them. % \kat{why are the values not important for the synthetic dataset? This seems a little weird, when said out of context.. our goal is to perturb the values, but do not really care about the way we perturb our values?} % Finally, notice that, depending on the results' variation, most diagrams are in logarithmic scale.