the-last-thing/text/evaluation/details.tex

\section{Setting, configurations, and data sets}
\label{sec:eval-dtl}
In this section we list all the relevant details regarding the evaluation setting (Section~\ref{subsec:eval-setup}), and we present the real and synthetic data sets that we used (Section~\ref{subsec:eval-dat}), along with the corresponding configurations (Section~\ref{subsec:eval-conf}).


\subsection{Machine setup}
\label{subsec:eval-setup}
We implemented our experiments\footnote{Source code available at \url{https://git.delkappa.com/manos/the-last-thing}} in Python $3$.$9$.$7$ and executed them on a machine with an Intel i$7$-$6700$HQ at $3$.$5$GHz CPU and $16$GB RAM, running Manjaro Linux $21$.$1$.$5$.
We repeated each experiment $100$ times and we report the mean over these iterations.
% \kat{It could be interesting to report also on the diagrams the std}
% \mk{I'll keep it in mind.}


\subsection{Data sets}
\label{subsec:eval-dat}
We performed experiments on real (Section~\ref{subsec:eval-dat-real}) and synthetic data sets (Section~\ref{subsec:eval-dat-syn}).


\subsubsection{Real data sets}
\label{subsec:eval-dat-real}
For uniformity and in order to be consistent, we sample from each of the following data sets the first $1,000$ entries that satisfy the configuration criteria that we discuss in detail in Section~\ref{subsec:eval-conf}.

\paragraph{Copenhagen}~\cite{sapiezynski2019interaction}
data set was collected via the smartphone devices of $851$ university students over a period of $4$ week as part of the Copenhagen Networks Study.
Each device was configured to be discoverable by and to discover nearby Bluetooth devices every $5$ minutes.
Upon discovery, each device registers (i)~the timestamp in seconds, (ii)~the device's unique identifier, (iii)~the unique identifier of the device that it discovered ($- 1$ when no device was found or $- 2$ for any non-participating device), and (iv)~the Received Signal Strength Indicator (RSSI) in dBm.
Half of the devices have registered data at at least $81\%$ of the possible timestamps.
$3$ devices ($449$, $550$, $689$) satisfy our configuration criteria (Section~\ref{subsec:eval-conf}) within their first $1,000$ entries.
From those $3$ devices, we picked the first one, i.e.,~device with identifier `$449$', and utilized its $1,000$ first entries out of $12,167$ unique valid contacts.
% \kat{why only the 1000 first contacts? why device 449? why only one device and not multiple ones, and then report the mean?}
% \mk{I explained why 449 and I added a general explanation in the intro of the subsection.}

\paragraph{HUE}~\cite{makonin2018hue}
contains the hourly energy consumption data of $22$ residential customers of BCHydro, a provincial power utility in British Columbia.
The measurements for each residence are saved individually and each measurement contains (i)~the date (YYYY-MM-DD), (ii)~the hour, and (iii)~the energy consumption in kWh.
In our experiments, we used the first residence, i.e.,~residence with identifier `$1$', that satisfies our configuration criteria (Section~\ref{subsec:eval-conf}) within its first $1,000$ entries. 
In those entries, out of a total of $29,231$ measurements, we estimated an average energy consumption equal to $0.88$kWh and a value range within $[0.28$, $4.45]$.
% \kat{again, explain your choices. Moreover, you make some conclusions later on, based on the characteristics of the data set, for example the density of the measurement values. You should describe all these characteristics in these paragraphs.}
% \mk{OK}

\paragraph{T-drive}~\cite{yuan2010t}
consists of $15$ million GPS data points of the trajectories of $10,357$ taxis in Beijing, spanning a period of $1$ week and a total distance of $9$ million kilometers.
The taxis reported their location data on average every $177$ seconds and $623$ meters approximately.
Each vehicle registers (i)~the taxi unique identifier, (ii)~the timestamp (YYYY-MM-DD HH:MM:SS), (iii)~longitude, and (iv)~latitude.
These measurements are stored individually per vehicle.
We sampled the first $1000$ data items of the taxi with identifier `$2$', which satisfied our configuration criteria (Section~\ref{subsec:eval-conf}).
% \kat{again, explain your choices}
% \mk{OK}


\subsubsection{Synthetic}
\label{subsec:eval-dat-syn}
We generated synthetic time series of length equal to $100$ timestamps, for which we varied the number and distribution of {\thethings}. 
In this way, we have a controlled data set that we can use to study the behavior of our proposal.
% \kat{more details needed. eg. what is the distributions and number of timestamps used? How many time series you generated? }
We take into account only the temporal order of the points and the position of regular and {\thething} events within the time series.
In Section~\ref{subsec:eval-conf}, we explain in more detail our configuration criteria.
% \kat{why is the value not important? at the energy consumption, they mattered}


\subsection{Configurations}
\label{subsec:eval-conf}
% \kat{add some info here.. what are the configurations for? What does landmark percentage refer to, and how does it matter? }
We vary the {\thething} percentage (Section~\ref{subsec:eval-conf-lmdk}), i.e.,~the ratio of timestamps that we attribute to {\thethings} and regular events, in order to explore the behavior of our methodology in all possible scenarios.
For each data set, we implement a privacy mechanism that injects noise related to the type of its attribute values and we tune the parameters of each mechanism accordingly (Section~\ref{subsec:eval-conf-prv}).
Last but not least, we explain how we generate synthetic data sets with various degrees of temporal correlation so as to observe the impact on the overall privacy loss (Section~\ref{subsec:eval-conf-cor}).


\subsubsection{{\Thething} percentage}
\label{subsec:eval-conf-lmdk}
In the Copenhagen data set, a {\thething} represents a timestamp when a specific contact device is registered.
After identifying the unique contacts within the sample, we achieve each desired {\thethings} to regular events ratio by considering a list that contains a part of these contact devices.
In more detail, we achieve 
$0\%$ {\thethings} by considering an empty list of contact devices,
$20\%$ by extending the list with $[3$, $6$, $11$, $12$, $25$, $29$, $36$, $39$, $41$, $46$, $47$, $50$, $52$, $56$, $57$, $61$, $63$, $78$, $80]$, 
$40\%$ with $[81$, $88$, $90$, $97$, $101$, $128$, $130$, $131$, $137$, $145$, $146$, $148$, $151$, $158$, $166$, $175$, $176]$, 
$60\%$ with $[181$, $182$, $192$, $195$, $196$, $201$, $203$, $207$, $221$, $230$, $235$, $237$, $239$, $241$, $254]$, 
$80\%$ with $[260$, $282$, $287$, $289$, $290$, $291$, $308$, $311$, $318$, $323$, $324$, $330$, $334$, $335$, $344$, $350$, $353$, $355$, $357$, $358$, $361$, $363]$, and 
$100\%$ by including all of the possible contacts.
% \kat{How did you decide which devices to add at each point?}
% \mk{I discussed it earlier.}

% \kat{Say what time-stamps are landmarks in this data set. What is the consumption threshld?}
% \mk{OK}
In HUE, we consider as {\thethings} the events that have energy consumption values below a certain threshold.
That is, we get $0$\%, $20$\% $40$\%, $60$\%, $80$\%, and $100$\% {\thethings} by setting the energy consumption threshold at $0.28$kWh, $1.12$kWh, $0.88$kWh, $0.68$kWh, $0.54$kWh, and $4.45$kWh respectively.

In T-drive, a {\thething} represents a location where a vehicle spend some time.
We achieved the desired {\thething} percentages by utilizing the method of Li et al.~\cite{li2008mining} for detecting stay points in trajectory data.
In more detail, the algorithm checks for each data item if each subsequent item is within a given distance threshold $\Delta l$ and measures the time period $\Delta t$ between the present point and the last subsequent point.
After analyzing the data and experimenting with different pairs of distance and time period, we achieve $0$\%, $20$\% $40$\%, $60$\%, $80$\%, and $100$\% {\thethings} by setting the ($\Delta l$ in meters, $\Delta t$ in minutes) pairs input to the stay point discovery method as [($0$, $1000$), ($2095$, $30$), ($2790$, $30$), ($3590$, $30$), ($4825$, $30$), ($10350$, $30$)].
% \kat{how did you come up with these numbers?}

We generated synthetic data with \emph{skewed} (the {\thethings} are distributed towards the beginning/end of the series), \emph{symmetric} (in the middle), \emph{bimodal} (both end and beginning), and \emph{uniform} (all over the time series) {\thething} distributions.
In order to get {\thething} sets with the above distribution features, we generate probability distributions with restricted domain to the beginning and end of the time series, and sample from them, without replacement, the desired number of points.
For each case, we place the location, i.e.,~centre, of the distribution accordingly.
That is, for symmetric we put the location in the middle of the time series and for left/right skewed to the right/left.
For bimodal we combine two mirrored skewed distributions.
Finally, for the uniform distribution we distribute the {\thethings} randomly throughout the time series.
For consistency, we calculate the scale parameter of the corresponding distribution depending on the length of the time series by setting it equal to the series' length over a constant.


\subsubsection{Privacy parameters}
\label{subsec:eval-conf-prv}
% \kat{Explain why you select each of these perturbation mechanisms for each of the datasets. Is the random response differential private? Mention it! }
For all of te real data sets, we implement $\varepsilon$-differential privacy by selecting a mechanism, from those that we described in Section~\ref{subsec:prv-mech}, that is best suited for the type of its sensitive attributes.
To perturb the contact tracing data of the Copenhagen data set, we utilize the \emph{random response} technique~\cite{wang2017locally}, and at each timestamp we report truthfully, with probability $p = \frac{e^\varepsilon}{e^\varepsilon + 1}$, whether the current contact is a {\thething} or not.
We randomize the energy consumption in HUE with the Laplace mechanism~\cite{dwork2014algorithmic}.
For T-drive, we perturb the location data with noise that we sample from the Planar Laplace mechanism~\cite{andres2013geo}.

We set the privacy budget $\varepsilon = 1$ for all of our experiments and, for simplicity, we assume that for every query sensitivity it holds that $\Delta f = 1$.
% \kat{why don't you consider other values as well?}
For the experiments that we performed on the synthetic data sets, the original values to be released are not relevant to what we we to observe, and thus we ignore them.
% \kat{why are the values not important for the synthetic dataset? This seems a little weird, when said out of context.. our goal is to perturb the values, but do not really care about the way we perturb our values?}
% Finally, notice that, depending on the results' variation, most diagrams are in logarithmic scale.


\subsubsection{Temporal correlation}
\label{subsec:eval-conf-cor}
% \kat{Did you find any correlation in the other data? Do you need the correlation matrix to be known a priori? Describe a little why you did not use the real data for correlations }
Despite the inherent presence of temporal correlation in time series, it is challenging to correctly discover and quantify it.
For this reason, and in order to create a more controlled environment for our experiments, we chose to conduct tests relevant to temporal correlation using synthetic data sets.
We model the temporal correlation in the synthetic data as a \emph{stochastic matrix} $P$, using a \emph{Markov Chain}~\cite{gagniuc2017markov}.
$P$ is an $n \times n$ matrix, where the element $P_{ij}$
%at the $i$th row of the $j$th column that 
represents the transition probability from a state $i$ to another state $j$, $\forall$ $i$, $j$ $\leq$ $n$.
It holds that the elements of every row $j$ of $P$ sum up to $1$.
We follow the \emph{Laplacian smoothing} technique~\cite{sorkine2004laplacian}, as utilized in~\cite{cao2018quantifying}, to generate the matrix $P$ with a degree of temporal correlation $s > 0$ equal to
% and generate a stochastic matrix $P$ with a degree of temporal correlation $s$ by calculating each element $P_{ij}$ as follows
$$\frac{(I_{n})_{ij} + s}{\sum_{k = 1}^{n}((I_{n})_{jk} + s)}$$
where $I_{n}$ is an \emph{identity matrix} of size $n$.
%, i.e.,~an $n \times n$ matrix with $1$s on its main diagonal and $0$s elsewhere.
% $s$ takes only positive values which are comparable only for stochastic matrices of the same size.
The value of $s$ is comparable only for stochastic matrices of the same size and dictates the strength of the correlation; the lower its value, 
% the lower the degree of uniformity of each row, and therefore 
the stronger the correlation degree.
%In general, larger transition matrices tend to be uniform, resulting in weaker correlation.
In our experiments, for simplicity, we set $n = 2$ and we investigate the effect of \emph{weak} ($s = 1$), \emph{moderate} ($s = 0.1$), and \emph{strong} ($s = 0.01$) temporal correlation degree on the overall privacy loss.
evaluation: Minor corrections 2021-10-14 17:17:31 +02:00			`\section{Setting, configurations, and data sets}`
evaluation: Added details 2021-10-09 03:59:18 +02:00			`\label{sec:eval-dtl}`
5.1 2021-10-12 15:42:27 +02:00			`In this section we list all the relevant details regarding the evaluation setting (Section~\ref{subsec:eval-setup}), and we present the real and synthetic data sets that we used (Section~\ref{subsec:eval-dat}), along with the corresponding configurations (Section~\ref{subsec:eval-conf}).`
evaluation: Added details 2021-10-09 03:59:18 +02:00

evaluation: Reviewed and replied to Katerina 2021-10-14 06:12:28 +02:00			`\subsection{Machine setup}`
evaluation: Added details 2021-10-09 03:59:18 +02:00			`\label{subsec:eval-setup}`
evaluation: Reviewed and replied to Katerina 2021-10-14 06:12:28 +02:00			`We implemented our experiments\footnote{Source code available at \url{https://git.delkappa.com/manos/the-last-thing}} in Python $3$.$9$.$7$ and executed them on a machine with an Intel i$7$-$6700$HQ at $3$.$5$GHz CPU and $16$GB RAM, running Manjaro Linux $21$.$1$.$5$.`
			`We repeated each experiment $100$ times and we report the mean over these iterations.`
			`% \kat{It could be interesting to report also on the diagrams the std}`
			`% \mk{I'll keep it in mind.}`
evaluation: Added details 2021-10-09 03:59:18 +02:00

			`\subsection{Data sets}`
			`\label{subsec:eval-dat}`
evaluation: Reviewed and replied to Katerina 2021-10-14 06:12:28 +02:00			`We performed experiments on real (Section~\ref{subsec:eval-dat-real}) and synthetic data sets (Section~\ref{subsec:eval-dat-syn}).`

evaluation: Added details 2021-10-09 03:59:18 +02:00
evaluation: Reviewed and replied to Katerina 2021-10-14 06:12:28 +02:00			`\subsubsection{Real data sets}`
			`\label{subsec:eval-dat-real}`
			`For uniformity and in order to be consistent, we sample from each of the following data sets the first $1,000$ entries that satisfy the configuration criteria that we discuss in detail in Section~\ref{subsec:eval-conf}.`
evaluation: Added details 2021-10-09 03:59:18 +02:00
			`\paragraph{Copenhagen}~\cite{sapiezynski2019interaction}`
5.1 2021-10-12 15:42:27 +02:00			`data set was collected via the smartphone devices of $851$ university students over a period of $4$ week as part of the Copenhagen Networks Study.`
evaluation: Added details 2021-10-09 03:59:18 +02:00			`Each device was configured to be discoverable by and to discover nearby Bluetooth devices every $5$ minutes.`
5.1 2021-10-12 15:42:27 +02:00			`Upon discovery, each device registers (i)~the timestamp in seconds, (ii)~the device's unique identifier, (iii)~the unique identifier of the device that it discovered ($- 1$ when no device was found or $- 2$ for any non-participating device), and (iv)~the Received Signal Strength Indicator (RSSI) in dBm.`
evaluation: Added details 2021-10-09 03:59:18 +02:00			`Half of the devices have registered data at at least $81\%$ of the possible timestamps.`
evaluation: Reviewed and replied to Katerina 2021-10-14 06:12:28 +02:00			`$3$ devices ($449$, $550$, $689$) satisfy our configuration criteria (Section~\ref{subsec:eval-conf}) within their first $1,000$ entries.`
			From those $3$ devices, we picked the first one, i.e.,~device with identifier `$449$', and utilized its $1,000$ first entries out of $12,167$ unique valid contacts.
			`% \kat{why only the 1000 first contacts? why device 449? why only one device and not multiple ones, and then report the mean?}`
			`% \mk{I explained why 449 and I added a general explanation in the intro of the subsection.}`
evaluation: Added details 2021-10-09 03:59:18 +02:00
			`\paragraph{HUE}~\cite{makonin2018hue}`
5.1 2021-10-12 15:42:27 +02:00			`contains the hourly energy consumption data of $22$ residential customers of BCHydro, a provincial power utility in British Columbia.`
evaluation: Added details 2021-10-09 03:59:18 +02:00			`The measurements for each residence are saved individually and each measurement contains (i)~the date (YYYY-MM-DD), (ii)~the hour, and (iii)~the energy consumption in kWh.`
evaluation: Reviewed and replied to Katerina 2021-10-14 06:12:28 +02:00			In our experiments, we used the first residence, i.e.,~residence with identifier `$1$', that satisfies our configuration criteria (Section~\ref{subsec:eval-conf}) within its first $1,000$ entries.
			`In those entries, out of a total of $29,231$ measurements, we estimated an average energy consumption equal to $0.88$kWh and a value range within $[0.28$, $4.45]$.`
			`% \kat{again, explain your choices. Moreover, you make some conclusions later on, based on the characteristics of the data set, for example the density of the measurement values. You should describe all these characteristics in these paragraphs.}`
			`% \mk{OK}`
evaluation: Added details 2021-10-09 03:59:18 +02:00
			`\paragraph{T-drive}~\cite{yuan2010t}`
			`consists of $15$ million GPS data points of the trajectories of $10,357$ taxis in Beijing, spanning a period of $1$ week and a total distance of $9$ million kilometers.`
			`The taxis reported their location data on average every $177$ seconds and $623$ meters approximately.`
			`Each vehicle registers (i)~the taxi unique identifier, (ii)~the timestamp (YYYY-MM-DD HH:MM:SS), (iii)~longitude, and (iv)~latitude.`
			`These measurements are stored individually per vehicle.`
evaluation: Reviewed and replied to Katerina 2021-10-14 06:12:28 +02:00			We sampled the first $1000$ data items of the taxi with identifier `$2$', which satisfied our configuration criteria (Section~\ref{subsec:eval-conf}).
			`% \kat{again, explain your choices}`
			`% \mk{OK}`

evaluation: Added details 2021-10-09 03:59:18 +02:00
			`\subsubsection{Synthetic}`
evaluation: Reviewed and replied to Katerina 2021-10-14 06:12:28 +02:00			`\label{subsec:eval-dat-syn}`
5.1 2021-10-12 15:42:27 +02:00			`We generated synthetic time series of length equal to $100$ timestamps, for which we varied the number and distribution of {\thethings}.`
evaluation: Reviewed and replied to Katerina 2021-10-14 06:12:28 +02:00			`In this way, we have a controlled data set that we can use to study the behavior of our proposal.`
			`% \kat{more details needed. eg. what is the distributions and number of timestamps used? How many time series you generated? }`
evaluation: Final review 2021-10-14 14:30:35 +02:00			`We take into account only the temporal order of the points and the position of regular and {\thething} events within the time series.`
evaluation: Reviewed and replied to Katerina 2021-10-14 06:12:28 +02:00			`In Section~\ref{subsec:eval-conf}, we explain in more detail our configuration criteria.`
			`% \kat{why is the value not important? at the energy consumption, they mattered}`
evaluation: Added details 2021-10-09 03:59:18 +02:00

			`\subsection{Configurations}`
			`\label{subsec:eval-conf}`
evaluation: Reviewed and replied to Katerina 2021-10-14 06:12:28 +02:00			`% \kat{add some info here.. what are the configurations for? What does landmark percentage refer to, and how does it matter? }`
evaluation: Final review 2021-10-14 14:30:35 +02:00			`We vary the {\thething} percentage (Section~\ref{subsec:eval-conf-lmdk}), i.e.,~the ratio of timestamps that we attribute to {\thethings} and regular events, in order to explore the behavior of our methodology in all possible scenarios.`
evaluation: Reviewed and replied to Katerina 2021-10-14 06:12:28 +02:00			`For each data set, we implement a privacy mechanism that injects noise related to the type of its attribute values and we tune the parameters of each mechanism accordingly (Section~\ref{subsec:eval-conf-prv}).`
evaluation: Final review 2021-10-14 14:30:35 +02:00			`Last but not least, we explain how we generate synthetic data sets with various degrees of temporal correlation so as to observe the impact on the overall privacy loss (Section~\ref{subsec:eval-conf-cor}).`
evaluation: Added details 2021-10-09 03:59:18 +02:00

evaluation: Reviewed and replied to Katerina 2021-10-14 06:12:28 +02:00			`\subsubsection{{\Thething} percentage}`
			`\label{subsec:eval-conf-lmdk}`
evaluation: Final review 2021-10-14 14:30:35 +02:00			`In the Copenhagen data set, a {\thething} represents a timestamp when a specific contact device is registered.`
evaluation: Reviewed and replied to Katerina 2021-10-14 06:12:28 +02:00			`After identifying the unique contacts within the sample, we achieve each desired {\thethings} to regular events ratio by considering a list that contains a part of these contact devices.`
			`In more detail, we achieve`
evaluation: Added details 2021-10-09 03:59:18 +02:00			`$0\%$ {\thethings} by considering an empty list of contact devices,`
			`$20\%$ by extending the list with $[3$, $6$, $11$, $12$, $25$, $29$, $36$, $39$, $41$, $46$, $47$, $50$, $52$, $56$, $57$, $61$, $63$, $78$, $80]$,`
			`$40\%$ with $[81$, $88$, $90$, $97$, $101$, $128$, $130$, $131$, $137$, $145$, $146$, $148$, $151$, $158$, $166$, $175$, $176]$,`
			`$60\%$ with $[181$, $182$, $192$, $195$, $196$, $201$, $203$, $207$, $221$, $230$, $235$, $237$, $239$, $241$, $254]$,`
			`$80\%$ with $[260$, $282$, $287$, $289$, $290$, $291$, $308$, $311$, $318$, $323$, $324$, $330$, $334$, $335$, $344$, $350$, $353$, $355$, $357$, $358$, $361$, $363]$, and`
			`$100\%$ by including all of the possible contacts.`
evaluation: Reviewed and replied to Katerina 2021-10-14 06:12:28 +02:00			`% \kat{How did you decide which devices to add at each point?}`
			`% \mk{I discussed it earlier.}`
evaluation: Added details 2021-10-09 03:59:18 +02:00
evaluation: Reviewed and replied to Katerina 2021-10-14 06:12:28 +02:00			`% \kat{Say what time-stamps are landmarks in this data set. What is the consumption threshld?}`
			`% \mk{OK}`
			`In HUE, we consider as {\thethings} the events that have energy consumption values below a certain threshold.`
			`That is, we get $0$\%, $20$\% $40$\%, $60$\%, $80$\%, and $100$\% {\thethings} by setting the energy consumption threshold at $0.28$kWh, $1.12$kWh, $0.88$kWh, $0.68$kWh, $0.54$kWh, and $4.45$kWh respectively.`
evaluation: Added details 2021-10-09 03:59:18 +02:00
evaluation: Reviewed and replied to Katerina 2021-10-14 06:12:28 +02:00			`In T-drive, a {\thething} represents a location where a vehicle spend some time.`
			`We achieved the desired {\thething} percentages by utilizing the method of Li et al.~\cite{li2008mining} for detecting stay points in trajectory data.`
evaluation: Added details 2021-10-09 03:59:18 +02:00			`In more detail, the algorithm checks for each data item if each subsequent item is within a given distance threshold $\Delta l$ and measures the time period $\Delta t$ between the present point and the last subsequent point.`
evaluation: Reviewed and replied to Katerina 2021-10-14 06:12:28 +02:00			`After analyzing the data and experimenting with different pairs of distance and time period, we achieve $0$\%, $20$\% $40$\%, $60$\%, $80$\%, and $100$\% {\thethings} by setting the ($\Delta l$ in meters, $\Delta t$ in minutes) pairs input to the stay point discovery method as [($0$, $1000$), ($2095$, $30$), ($2790$, $30$), ($3590$, $30$), ($4825$, $30$), ($10350$, $30$)].`
			`% \kat{how did you come up with these numbers?}`
evaluation: Added details 2021-10-09 03:59:18 +02:00
			`We generated synthetic data with \emph{skewed} (the {\thethings} are distributed towards the beginning/end of the series), \emph{symmetric} (in the middle), \emph{bimodal} (both end and beginning), and \emph{uniform} (all over the time series) {\thething} distributions.`
evaluation: Reviewed and replied to Katerina 2021-10-14 06:12:28 +02:00			`In order to get {\thething} sets with the above distribution features, we generate probability distributions with restricted domain to the beginning and end of the time series, and sample from them, without replacement, the desired number of points.`
			`For each case, we place the location, i.e.,~centre, of the distribution accordingly.`
evaluation: Final review 2021-10-14 14:30:35 +02:00			`That is, for symmetric we put the location in the middle of the time series and for left/right skewed to the right/left.`
			`For bimodal we combine two mirrored skewed distributions.`
evaluation: Reviewed and replied to Katerina 2021-10-14 06:12:28 +02:00			`Finally, for the uniform distribution we distribute the {\thethings} randomly throughout the time series.`
evaluation: Final review 2021-10-14 14:30:35 +02:00			`For consistency, we calculate the scale parameter of the corresponding distribution depending on the length of the time series by setting it equal to the series' length over a constant.`
evaluation: Added details 2021-10-09 03:59:18 +02:00
5.1 2021-10-12 15:42:27 +02:00
evaluation: Minor corrections in details 2021-10-09 12:26:47 +02:00			`\subsubsection{Privacy parameters}`
evaluation: Reviewed and replied to Katerina 2021-10-14 06:12:28 +02:00			`\label{subsec:eval-conf-prv}`
			`% \kat{Explain why you select each of these perturbation mechanisms for each of the datasets. Is the random response differential private? Mention it! }`
evaluation: Final review 2021-10-14 14:30:35 +02:00			`For all of te real data sets, we implement $\varepsilon$-differential privacy by selecting a mechanism, from those that we described in Section~\ref{subsec:prv-mech}, that is best suited for the type of its sensitive attributes.`
evaluation: Reviewed and replied to Katerina 2021-10-14 06:12:28 +02:00			`To perturb the contact tracing data of the Copenhagen data set, we utilize the \emph{random response} technique~\cite{wang2017locally}, and at each timestamp we report truthfully, with probability $p = \frac{e^\varepsilon}{e^\varepsilon + 1}$, whether the current contact is a {\thething} or not.`
evaluation: Final review 2021-10-14 14:30:35 +02:00			`We randomize the energy consumption in HUE with the Laplace mechanism~\cite{dwork2014algorithmic}.`
			`For T-drive, we perturb the location data with noise that we sample from the Planar Laplace mechanism~\cite{andres2013geo}.`
evaluation: Minor corrections in details 2021-10-09 12:26:47 +02:00
evaluation: Reviewed and replied to Katerina 2021-10-14 06:12:28 +02:00			`We set the privacy budget $\varepsilon = 1$ for all of our experiments and, for simplicity, we assume that for every query sensitivity it holds that $\Delta f = 1$.`
			`% \kat{why don't you consider other values as well?}`
evaluation: Final review 2021-10-14 14:30:35 +02:00			`For the experiments that we performed on the synthetic data sets, the original values to be released are not relevant to what we we to observe, and thus we ignore them.`
evaluation: Reviewed and replied to Katerina 2021-10-14 06:12:28 +02:00			`% \kat{why are the values not important for the synthetic dataset? This seems a little weird, when said out of context.. our goal is to perturb the values, but do not really care about the way we perturb our values?}`
evaluation: Minor corrections in details 2021-10-09 12:26:47 +02:00			`% Finally, notice that, depending on the results' variation, most diagrams are in logarithmic scale.`


evaluation: Added details 2021-10-09 03:59:18 +02:00			`\subsubsection{Temporal correlation}`
evaluation: Reviewed and replied to Katerina 2021-10-14 06:12:28 +02:00			`\label{subsec:eval-conf-cor}`
			`% \kat{Did you find any correlation in the other data? Do you need the correlation matrix to be known a priori? Describe a little why you did not use the real data for correlations }`
			`Despite the inherent presence of temporal correlation in time series, it is challenging to correctly discover and quantify it.`
			`For this reason, and in order to create a more controlled environment for our experiments, we chose to conduct tests relevant to temporal correlation using synthetic data sets.`
evaluation: Added details 2021-10-09 03:59:18 +02:00			`We model the temporal correlation in the synthetic data as a \emph{stochastic matrix} $P$, using a \emph{Markov Chain}~\cite{gagniuc2017markov}.`
evaluation: Final review 2021-10-14 14:30:35 +02:00			`$P$ is an $n \times n$ matrix, where the element $P_{ij}$`
evaluation: Added details 2021-10-09 03:59:18 +02:00			`%at the $i$th row of the $j$th column that`
evaluation: Minor corrections 2021-10-14 17:17:31 +02:00			`represents the transition probability from a state $i$ to another state $j$, $\forall$ $i$, $j$ $\leq$ $n$.`
evaluation: Added details 2021-10-09 03:59:18 +02:00			`It holds that the elements of every row $j$ of $P$ sum up to $1$.`
evaluation: Final review 2021-10-14 14:30:35 +02:00			`We follow the \emph{Laplacian smoothing} technique~\cite{sorkine2004laplacian}, as utilized in~\cite{cao2018quantifying}, to generate the matrix $P$ with a degree of temporal correlation $s > 0$ equal to`
evaluation: Added details 2021-10-09 03:59:18 +02:00			`% and generate a stochastic matrix $P$ with a degree of temporal correlation $s$ by calculating each element $P_{ij}$ as follows`
			`$$\frac{(I_{n})_{ij} + s}{\sum_{k = 1}^{n}((I_{n})_{jk} + s)}$$`
			`where $I_{n}$ is an \emph{identity matrix} of size $n$.`
			`%, i.e.,~an $n \times n$ matrix with $1$s on its main diagonal and $0$s elsewhere.`
			`% $s$ takes only positive values which are comparable only for stochastic matrices of the same size.`
			`The value of $s$ is comparable only for stochastic matrices of the same size and dictates the strength of the correlation; the lower its value,`
			`% the lower the degree of uniformity of each row, and therefore`
			`the stronger the correlation degree.`
			`%In general, larger transition matrices tend to be uniform, resulting in weaker correlation.`
			`In our experiments, for simplicity, we set $n = 2$ and we investigate the effect of \emph{weak} ($s = 1$), \emph{moderate} ($s = 0.1$), and \emph{strong} ($s = 0.01$) temporal correlation degree on the overall privacy loss.`