the-last-thing/text/evaluation/details.tex

\section{Details}
\label{sec:eval-dtl}

In this section we list all the relevant details regarding the setting of the evaluation (Section~\ref{subsec:eval-setup}), and the real and synthetic data sets that we used(Section~\ref{subsec:eval-dat}), along with the corresponding configurations (Section~\ref{subsec:eval-conf}).


\subsection{Setting}
\label{subsec:eval-setup}

We implemented our experiments\footnote{Code available at \url{https://git.delkappa.com/manos/the-last-thing}} in Python $3$.$9$.$7$ and executed them on a machine with an Intel i$7$-$6700$HQ at $3$.$5$GHz CPU and $16$GB RAM, running Manjaro Linux $21$.$1$.$5$.
We repeated each experiment $100$ times and we report the mean over these iterations.


\subsection{Data sets}
\label{subsec:eval-dat}

\subsubsection{Real}

\paragraph{Copenhagen}~\cite{sapiezynski2019interaction}
data set that was collected via the smartphone devices of $851$ university students over a period of $4$ week as part of the Copenhagen Networks Study.
Each device was configured to be discoverable by and to discover nearby Bluetooth devices every $5$ minutes.
Upon discovery each device registers, (i)~the timestamp in seconds, (ii)~the device's unique identifier, (iii)~the unique identifier of the device that it discovered ($- 1$ when no device was found or $- 2$ for any non-participating device), and (iv)~the Received Signal Strength Indicator (RSSI) in dBm.
Half of the devices have registered data at at least $81\%$ of the possible timestamps.
From this data set, we utilized the $1,000$ first contacts out of $12,167$ valid unique contacts of the device with identifier `$449$'.

\paragraph{HUE}~\cite{makonin2018hue}
contains the hourly energy consumption data of $22$ residential customers of BCHydro, a provincial power utility, in British Columbia.
The measurements for each residence are saved individually and each measurement contains (i)~the date (YYYY-MM-DD), (ii)~the hour, and (iii)~the energy consumption in kWh.
In our experiments, we used the first $1,000$ out of $29,231$ measurements of the residence with identifier `$1$', average energy consumption equal to $0.88$kWh, and value range $[0.28$, $4.45]$.

\paragraph{T-drive}~\cite{yuan2010t}
consists of $15$ million GPS data points of the trajectories of $10,357$ taxis in Beijing, spanning a period of $1$ week and a total distance of $9$ million kilometers.
The taxis reported their location data on average every $177$ seconds and $623$ meters approximately.
Each vehicle registers (i)~the taxi unique identifier, (ii)~the timestamp (YYYY-MM-DD HH:MM:SS), (iii)~longitude, and (iv)~latitude.
These measurements are stored individually per vehicle.
We sampled the first $1000$ data items of the taxi with identifier `$2$'.

\subsubsection{Synthetic}
We generated synthetic time series of length equal to $100$ timestamps, for which we varied the number and distribution of {\thethings}.
We take into account only the temporal order of the points and the position of regular and {\thething} events within the series. 


\subsection{Configurations}
\label{subsec:eval-conf}

\subsubsection{{\Thething} percentage}

For the Copenhagen data set, we achieve 
$0\%$ {\thethings} by considering an empty list of contact devices,
$20\%$ by extending the list with $[3$, $6$, $11$, $12$, $25$, $29$, $36$, $39$, $41$, $46$, $47$, $50$, $52$, $56$, $57$, $61$, $63$, $78$, $80]$, 
$40\%$ with $[81$, $88$, $90$, $97$, $101$, $128$, $130$, $131$, $137$, $145$, $146$, $148$, $151$, $158$, $166$, $175$, $176]$, 
$60\%$ with $[181$, $182$, $192$, $195$, $196$, $201$, $203$, $207$, $221$, $230$, $235$, $237$, $239$, $241$, $254]$, 
$80\%$ with $[260$, $282$, $287$, $289$, $290$, $291$, $308$, $311$, $318$, $323$, $324$, $330$, $334$, $335$, $344$, $350$, $353$, $355$, $357$, $358$, $361$, $363]$, and 
$100\%$ by including all of the possible contacts.

In HUE, we get $0$\%, $20$\% $40$\%, $60$\%, $80$\%, and $100$\% {\thethings} by setting the energy consumption threshold below $0.28$kWh, $1.12$kWh, $0.88$kWh, $0.68$kWh, $0.54$kWh, $4.45$kWh respectively.

In T-drive, we achieved the desired {\thething} percentages by utilizing the method of Li et al.~\cite{li2008mining} for detecting stay points in trajectory data.
In more detail, the algorithm checks for each data item if each subsequent item is within a given distance threshold $\Delta l$ and measures the time period $\Delta t$ between the present point and the last subsequent point.
We achieve $0$\%, $20$\% $40$\%, $60$\%, $80$\%, and $100$\% {\thethings} by setting the ($\Delta l$ in meters, $\Delta t$ in minutes) pairs input to the stay point discovery method as [($0$, $1000$), ($2095$, $30$), ($2790$, $30$), ($3590$, $30$), ($4825$, $30$), ($10350$, $30$)].

We generated synthetic data with \emph{skewed} (the {\thethings} are distributed towards the beginning/end of the series), \emph{symmetric} (in the middle), \emph{bimodal} (both end and beginning), and \emph{uniform} (all over the time series) {\thething} distributions.
In order to get {\thethings} with the above distribution features, we generate probability distributions with appropriate characteristics and sample from them, without replacement, the desired number of points.
%The generated distributions are representative of the cases that we wish to examine during the experiments.
For example, for a left-skewed {\thething} distribution we would utilize a truncated distribution resulting from the restriction of the domain of a distribution to the beginning and end of the time series with its location shifted to the center of the right half of the series.
For consistency, we calculate the scale parameter depending on the length of the series by setting it equal to the series' length over a constant.

Notice that in our experiments, in the cases when we have $0\%$ and $100\%$ of the events being {\thethings}, we get the same behavior as in event- and user-level privacy respectively.
This happens due the fact that at each timestamp we take into account only the data items at the current timestamp and ignore the rest of the time series (event-level) when there are no {\thethings}.
Whereas, when each timestamp corresponds to a {\thething} we consider and protect all the events throughout the entire series (user-level).


\subsubsection{Privacy parameters}

To perturb the contact tracing data of the Copenhagen data set, we utilize the \emph{random response} technique~\cite{wang2017locally} to report with probability $p = \frac{e^\varepsilon}{e^\varepsilon + 1}$ whether the current contact is a {\thething} or not.
We randomize the energy consumption in HUE with the Laplace mechanism (described in detail in Section~\ref{subsec:prv-mech}).
We inject noise to the spatial values in T-drive that we sample from the Planar Laplace mechanism~\cite{andres2013geo}.

We set the privacy budget $\varepsilon = 1$, and, for simplicity, we assume that for every query sensitivity it holds that $\Delta f = 1$. 
For the experiments performed on the synthetic data sets, the original values to be released do not influence the outcome of our conclusions, thus we ignore them.
% Finally, notice that, depending on the results' variation, most diagrams are in logarithmic scale.


\subsubsection{Temporal correlation}

We model the temporal correlation in the synthetic data as a \emph{stochastic matrix} $P$, using a \emph{Markov Chain}~\cite{gagniuc2017markov}.
$P$ is a $n \times n$ matrix, where the element $P_{ij}$
%at the $i$th row of the $j$th column that 
represents the transition probability from a state $i$ to another state $j$.
%, $\forall i, j \leq n$.
It holds that the elements of every row $j$ of $P$ sum up to $1$.
We follow the \emph{Laplacian smoothing} technique~\cite{sorkine2004laplacian} as utilized in~\cite{cao2018quantifying} to generate the matrix $P$ with a degree of temporal correlation $s > 0$ equal to
% and generate a stochastic matrix $P$ with a degree of temporal correlation $s$ by calculating each element $P_{ij}$ as follows
$$\frac{(I_{n})_{ij} + s}{\sum_{k = 1}^{n}((I_{n})_{jk} + s)}$$
where $I_{n}$ is an \emph{identity matrix} of size $n$.
%, i.e.,~an $n \times n$ matrix with $1$s on its main diagonal and $0$s elsewhere.
% $s$ takes only positive values which are comparable only for stochastic matrices of the same size.
The value of $s$ is comparable only for stochastic matrices of the same size and dictates the strength of the correlation; the lower its value, 
% the lower the degree of uniformity of each row, and therefore 
the stronger the correlation degree.
%In general, larger transition matrices tend to be uniform, resulting in weaker correlation.
In our experiments, for simplicity, we set $n = 2$ and we investigate the effect of \emph{weak} ($s = 1$), \emph{moderate} ($s = 0.1$), and \emph{strong} ($s = 0.01$) temporal correlation degree on the overall privacy loss.
evaluation: Added details 2021-10-09 03:59:18 +02:00			`\section{Details}`
			`\label{sec:eval-dtl}`

			`In this section we list all the relevant details regarding the setting of the evaluation (Section~\ref{subsec:eval-setup}), and the real and synthetic data sets that we used(Section~\ref{subsec:eval-dat}), along with the corresponding configurations (Section~\ref{subsec:eval-conf}).`


			`\subsection{Setting}`
			`\label{subsec:eval-setup}`

			`We implemented our experiments\footnote{Code available at \url{https://git.delkappa.com/manos/the-last-thing}} in Python $3$.$9$.$7$ and executed them on a machine with an Intel i$7$-$6700$HQ at $3$.$5$GHz CPU and $16$GB RAM, running Manjaro Linux $21$.$1$.$5$.`
			`We repeated each experiment $100$ times and we report the mean over these iterations.`


			`\subsection{Data sets}`
			`\label{subsec:eval-dat}`

			`\subsubsection{Real}`

			`\paragraph{Copenhagen}~\cite{sapiezynski2019interaction}`
			`data set that was collected via the smartphone devices of $851$ university students over a period of $4$ week as part of the Copenhagen Networks Study.`
			`Each device was configured to be discoverable by and to discover nearby Bluetooth devices every $5$ minutes.`
			`Upon discovery each device registers, (i)~the timestamp in seconds, (ii)~the device's unique identifier, (iii)~the unique identifier of the device that it discovered ($- 1$ when no device was found or $- 2$ for any non-participating device), and (iv)~the Received Signal Strength Indicator (RSSI) in dBm.`
			`Half of the devices have registered data at at least $81\%$ of the possible timestamps.`
			From this data set, we utilized the $1,000$ first contacts out of $12,167$ valid unique contacts of the device with identifier `$449$'.

			`\paragraph{HUE}~\cite{makonin2018hue}`
			`contains the hourly energy consumption data of $22$ residential customers of BCHydro, a provincial power utility, in British Columbia.`
			`The measurements for each residence are saved individually and each measurement contains (i)~the date (YYYY-MM-DD), (ii)~the hour, and (iii)~the energy consumption in kWh.`
			In our experiments, we used the first $1,000$ out of $29,231$ measurements of the residence with identifier `$1$', average energy consumption equal to $0.88$kWh, and value range $[0.28$, $4.45]$.

			`\paragraph{T-drive}~\cite{yuan2010t}`
			`consists of $15$ million GPS data points of the trajectories of $10,357$ taxis in Beijing, spanning a period of $1$ week and a total distance of $9$ million kilometers.`
			`The taxis reported their location data on average every $177$ seconds and $623$ meters approximately.`
			`Each vehicle registers (i)~the taxi unique identifier, (ii)~the timestamp (YYYY-MM-DD HH:MM:SS), (iii)~longitude, and (iv)~latitude.`
			`These measurements are stored individually per vehicle.`
			We sampled the first $1000$ data items of the taxi with identifier `$2$'.

			`\subsubsection{Synthetic}`
			`We generated synthetic time series of length equal to $100$ timestamps, for which we varied the number and distribution of {\thethings}.`
			`We take into account only the temporal order of the points and the position of regular and {\thething} events within the series.`


			`\subsection{Configurations}`
			`\label{subsec:eval-conf}`

evaluation: Minor corrections 2021-10-11 01:13:45 +02:00			`\subsubsection{{\Thething} percentage}`
evaluation: Added details 2021-10-09 03:59:18 +02:00
text: Minor corrections 2021-10-09 12:09:59 +02:00			`For the Copenhagen data set, we achieve`
evaluation: Added details 2021-10-09 03:59:18 +02:00			`$0\%$ {\thethings} by considering an empty list of contact devices,`
			`$20\%$ by extending the list with $[3$, $6$, $11$, $12$, $25$, $29$, $36$, $39$, $41$, $46$, $47$, $50$, $52$, $56$, $57$, $61$, $63$, $78$, $80]$,`
			`$40\%$ with $[81$, $88$, $90$, $97$, $101$, $128$, $130$, $131$, $137$, $145$, $146$, $148$, $151$, $158$, $166$, $175$, $176]$,`
			`$60\%$ with $[181$, $182$, $192$, $195$, $196$, $201$, $203$, $207$, $221$, $230$, $235$, $237$, $239$, $241$, $254]$,`
			`$80\%$ with $[260$, $282$, $287$, $289$, $290$, $291$, $308$, $311$, $318$, $323$, $324$, $330$, $334$, $335$, $344$, $350$, $353$, $355$, $357$, $358$, $361$, $363]$, and`
			`$100\%$ by including all of the possible contacts.`

evaluation: Minor corrections 2021-10-11 01:13:45 +02:00			`In HUE, we get $0$\%, $20$\% $40$\%, $60$\%, $80$\%, and $100$\% {\thethings} by setting the energy consumption threshold below $0.28$kWh, $1.12$kWh, $0.88$kWh, $0.68$kWh, $0.54$kWh, $4.45$kWh respectively.`
evaluation: Added details 2021-10-09 03:59:18 +02:00
evaluation: Minor corrections 2021-10-11 01:13:45 +02:00			`In T-drive, we achieved the desired {\thething} percentages by utilizing the method of Li et al.~\cite{li2008mining} for detecting stay points in trajectory data.`
evaluation: Added details 2021-10-09 03:59:18 +02:00			`In more detail, the algorithm checks for each data item if each subsequent item is within a given distance threshold $\Delta l$ and measures the time period $\Delta t$ between the present point and the last subsequent point.`
evaluation: Minor corrections 2021-10-11 01:13:45 +02:00			`We achieve $0$\%, $20$\% $40$\%, $60$\%, $80$\%, and $100$\% {\thethings} by setting the ($\Delta l$ in meters, $\Delta t$ in minutes) pairs input to the stay point discovery method as [($0$, $1000$), ($2095$, $30$), ($2790$, $30$), ($3590$, $30$), ($4825$, $30$), ($10350$, $30$)].`
evaluation: Added details 2021-10-09 03:59:18 +02:00
			`We generated synthetic data with \emph{skewed} (the {\thethings} are distributed towards the beginning/end of the series), \emph{symmetric} (in the middle), \emph{bimodal} (both end and beginning), and \emph{uniform} (all over the time series) {\thething} distributions.`
			`In order to get {\thethings} with the above distribution features, we generate probability distributions with appropriate characteristics and sample from them, without replacement, the desired number of points.`
			`%The generated distributions are representative of the cases that we wish to examine during the experiments.`
evaluation: Minor corrections 2021-10-11 01:13:45 +02:00			`For example, for a left-skewed {\thething} distribution we would utilize a truncated distribution resulting from the restriction of the domain of a distribution to the beginning and end of the time series with its location shifted to the center of the right half of the series.`
evaluation: Added details 2021-10-09 03:59:18 +02:00			`For consistency, we calculate the scale parameter depending on the length of the series by setting it equal to the series' length over a constant.`

evaluation: Moved some general info to details 2021-10-10 22:27:29 +02:00			`Notice that in our experiments, in the cases when we have $0\%$ and $100\%$ of the events being {\thethings}, we get the same behavior as in event- and user-level privacy respectively.`
			`This happens due the fact that at each timestamp we take into account only the data items at the current timestamp and ignore the rest of the time series (event-level) when there are no {\thethings}.`
			`Whereas, when each timestamp corresponds to a {\thething} we consider and protect all the events throughout the entire series (user-level).`

evaluation: Minor corrections in details 2021-10-09 12:26:47 +02:00
			`\subsubsection{Privacy parameters}`

			`To perturb the contact tracing data of the Copenhagen data set, we utilize the \emph{random response} technique~\cite{wang2017locally} to report with probability $p = \frac{e^\varepsilon}{e^\varepsilon + 1}$ whether the current contact is a {\thething} or not.`
			`We randomize the energy consumption in HUE with the Laplace mechanism (described in detail in Section~\ref{subsec:prv-mech}).`
			`We inject noise to the spatial values in T-drive that we sample from the Planar Laplace mechanism~\cite{andres2013geo}.`

			`We set the privacy budget $\varepsilon = 1$, and, for simplicity, we assume that for every query sensitivity it holds that $\Delta f = 1$.`
			`For the experiments performed on the synthetic data sets, the original values to be released do not influence the outcome of our conclusions, thus we ignore them.`
			`% Finally, notice that, depending on the results' variation, most diagrams are in logarithmic scale.`


evaluation: Added details 2021-10-09 03:59:18 +02:00			`\subsubsection{Temporal correlation}`

			`We model the temporal correlation in the synthetic data as a \emph{stochastic matrix} $P$, using a \emph{Markov Chain}~\cite{gagniuc2017markov}.`
			`$P$ is a $n \times n$ matrix, where the element $P_{ij}$`
			`%at the $i$th row of the $j$th column that`
			`represents the transition probability from a state $i$ to another state $j$.`
			`%, $\forall i, j \leq n$.`
			`It holds that the elements of every row $j$ of $P$ sum up to $1$.`
			`We follow the \emph{Laplacian smoothing} technique~\cite{sorkine2004laplacian} as utilized in~\cite{cao2018quantifying} to generate the matrix $P$ with a degree of temporal correlation $s > 0$ equal to`
			`% and generate a stochastic matrix $P$ with a degree of temporal correlation $s$ by calculating each element $P_{ij}$ as follows`
			`$$\frac{(I_{n})_{ij} + s}{\sum_{k = 1}^{n}((I_{n})_{jk} + s)}$$`
			`where $I_{n}$ is an \emph{identity matrix} of size $n$.`
			`%, i.e.,~an $n \times n$ matrix with $1$s on its main diagonal and $0$s elsewhere.`
			`% $s$ takes only positive values which are comparable only for stochastic matrices of the same size.`
			`The value of $s$ is comparable only for stochastic matrices of the same size and dictates the strength of the correlation; the lower its value,`
			`% the lower the degree of uniformity of each row, and therefore`
			`the stronger the correlation degree.`
			`%In general, larger transition matrices tend to be uniform, resulting in weaker correlation.`
			`In our experiments, for simplicity, we set $n = 2$ and we investigate the effect of \emph{weak} ($s = 1$), \emph{moderate} ($s = 0.1$), and \emph{strong} ($s = 0.01$) temporal correlation degree on the overall privacy loss.`