Merge branch 'master'

This commit is contained in:
Manos Katsomallos 2021-10-15 14:41:09 +02:00
commit 20010b0167
30 changed files with 384 additions and 234 deletions

View File

@ -94,6 +94,7 @@ def main(args):
print(' [OK]', flush=True)
<<<<<<< HEAD
'''
Parse arguments.
@ -101,14 +102,27 @@ def main(args):
iter - The number of iterations.
time - The time limit of the sequence.
'''
=======
>>>>>>> c41503ed41d8a9bd773425e9d430d4468ff45b92
def parse_args():
'''
Parse arguments.
Optional:
iter - The number of repetitions.
time - The time limit of the sequence.
'''
# Create argument parser.
parser = argparse.ArgumentParser()
# Mandatory arguments.
# Optional arguments.
<<<<<<< HEAD
parser.add_argument('-i', '--iter', help='The number of iterations.', type=int, default=1)
=======
parser.add_argument('-i', '--iter', help='The number of repetitions.', type=int, default=1)
>>>>>>> c41503ed41d8a9bd773425e9d430d4468ff45b92
parser.add_argument('-t', '--time', help='The time limit of the sequence.', type=int, default=100)
# Parse arguments.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

View File

@ -1,27 +1,31 @@
\chapter{Abstract}
\label{ch:abs}
\kat{Il faut aussi en francais :) }
% \kat{Il faut aussi en francais :) }
% \mk{D'accord :( }
Sensors, portable devices, and location-based services, generate massive amounts of geo-tagged, and/or location- and user-related data on a daily basis.
The manipulation of such data is useful in numerous application domains, e.g.,~healthcare, intelligent buildings, and traffic monitoring.
A high percentage of these data carry information of users' activities and other personal details, and thus their manipulation and sharing arise concerns about the privacy of the individuals involved.
To enable the secure---from the users' privacy perspective---data sharing, researchers have already proposed various seminal techniques for the protection of users' privacy.
A high percentage of these data carry information of user activities and other personal details, and thus their manipulation and sharing raise concerns about the privacy of the individuals involved.
To enable the secure---from the user privacy perspective---data sharing, researchers have already proposed various seminal techniques for the protection of user privacy.
However, the continuous fashion in which data are generated nowadays, and the high availability of external sources of information, pose more threats and add extra challenges to the problem.
\kat{Mention here the extra challenges posed by the specific problem that you address : the Landmark privacy}
% \kat{Mention here the extra challenges posed by the specific problem that you address : the Landmark privacy}
It is therefore essential to design solutions that not only guarantee privacy protection but also provide configurability and account the preferences of the users.
% Survey
In this thesis, we visit the works done on data privacy for continuous data publishing, and report on the proposed solutions, with a special focus on solutions concerning location or geo-referenced data.
As a matter of fact, a wealth of algorithms have been proposed for privacy-preserving data publishing, either for microdata or statistical data.
In this thesis, we investigate the literature regarding data privacy in continuous data publishing, and report on the proposed solutions, with a special focus on solutions concerning location or geo-referenced data.
As a matter of fact, a wealth of algorithms has been proposed for privacy-preserving data publishing, either for microdata or statistical data.
In this context, we seek to offer a guide that would allow readers to choose the proper algorithm(s) for their specific use case accordingly.
We provide an insight into time-related properties of the algorithms, e.g.,~if they work on infinite, real-time data, or if they take into consideration existing data dependencies.
We provide an insight into time-related properties of the algorithms, e.g.,~if they work on infinite, real-time data, or if they take into consideration existing data dependence.
% Landmarks
Having discussed the literature around continuous data publication, we continue to propose a novel type of data privacy, called \emph{\thething} privacy.
Having discussed the literature around continuous data publishing, we continue to propose a novel type of data privacy, called \emph{{\thething} privacy}.
We argue that in continuous data publishing, events are not equally significant in terms of privacy, and hence they should affect the privacy-preserving processing differently.
Differential privacy is a well-established paradigm in privacy-preserving time series publishing.
Different schemes exist, protecting either a single timestamp, or all the data per user or per window in the time series, considering however all timestamps as equally significant.
The novel scheme that we propose, \emph{\thething} privacy,is based on differential privacy, but also takes into account significant events (\emph{\thethings}) in the time series and allocates the available privacy budget accordingly.
We design three privacy models that guarantee {\thething} privacy and validate our proposal on real and synthetic data sets. \kat{add selection, and a small comment on the conclusions driven by the experiments.}
The novel scheme that we propose, {\thething} privacy, is based on differential privacy, but also takes into account significant events (\emph{\thethings}) in the time series and allocates the available privacy budget accordingly.
We design three privacy schemes that guarantee {\thething} privacy and further extend them in order to provide more robust privacy protection to the {\thething} set.
We evaluate our proposal on real and synthetic data sets and assess the impact on data utility with emphasis on situations under the presence of temporal correlation.
% \kat{add selection, and a small comment on the conclusions driven by the experiments.}
The results of the experimental evaluation and comparative analysis of {\thething} privacy validate its applicability to several use case scenarios with and without the presence of temporal correlation.
\paragraph{Keywords:}

View File

@ -1,14 +1,16 @@
\chapter{Acknowledgements}
\label{ch:ack}
\mk{WIP}
Upon the completion of my thesis, I would like to express my deep gratitude to my research supervisors for their patient guidance, enthusiastic encouragement and useful critiques of this research work.
I would also like to thank the reporters for their feedback, comments, and time.
Upon the completion of my thesis, I would like to express my deep gratitude to Prof. Dimitris Kotzinos for believing in me and for providing me with opportunities that helped me pave my path in academia.
% \kat{the jury and the reporters do not contribute; thank them for their feedback, comments and time}
This thesis would not have been possible without the patient guidance of Prof. Katerina Tzompanaki.
Her love for learning and hard work were inspiring and served as the catalyst for every single step that I made towards getting a grasp on computer science.
A special thanks to my departments faculty, staff and fellow researchers for their valuable assistance whenever needed and for creating a pleasant and creative environment during my studies.
I am genuinely grateful to the reporters for their time, effort, and valuable feedback.
Last but not least, I wish to thank my family and friends for their unconditional support and encouragement all these years.
A special thanks goes to Alexandros Kontarinis for being an exemplary colleague and a unique companion during this journey.
I would also like to thank the department's faculty, the lab's staff and researchers for creating a pleasant and creative environment.
Last but not least, I wish to express my thankfulness to my family and friends for their unconditional support and encouragement all these years.
\bigskip
\noindent

View File

@ -17,7 +17,7 @@
@online{acxiom,
title = {Acxiom},
url = {https://acxiom.com},
year = {Accessed on November 11, 2020}
year = {Accessed on October 11, 2021}
}
@book{adler2010geometry,
@ -268,7 +268,7 @@
year = {2016},
publisher = {Chambers and Partners},
howpublished = {\url{https://chambersandpartners.com/article/713/personal-data-the-new-oil-of-the-digital-economy}},
note = {Accessed on December 1, 2019}
note = {Accessed on October 11, 2021}
}
@article{chan2011private,
@ -299,7 +299,7 @@
publisher = {Channel 4},
year = {2018},
howpublished = {\url{https://channel4.com/news/data-democracy-and-dirty-tricks-cambridge-analytica-uncovered-investigation-expose}},
note = {Accessed on December 1, 2019}
note = {Accessed on October 11, 2021}
}
@inproceedings{chatzikokolakis2015geo,
@ -548,7 +548,7 @@
publisher = {The Economist},
year = {2016},
howpublished = {\url{https://economist.com/leaders/2017/05/06/the-worlds-most-valuable-resource-is-no-longer-oil-but-data}},
note = {Accessed on December 1, 2019}
note = {Accessed on October 11, 2021}
}
@inproceedings{efthymiou2015big,
@ -598,13 +598,13 @@
@online{experian,
title = {Experian},
url = {https://experian.com},
year = {Accessed on November 11, 2020}
year = {Accessed on October 11, 2021}
}
@online{facebook,
title = {Facebook},
\url = {https://facebook.com},
year = {Accessed on November 11, 2020}
year = {Accessed on October 11, 2021}
}
@inproceedings{fan2013differentially,
@ -646,7 +646,7 @@
@online{foursquare,
title = {Foursquare},
url = {https://foursquare.com},
year = {Accessed on November 11, 2020}
year = {Accessed on October 11, 2021}
}
@misc{franceschi-bicchierairussell2015redditor,
@ -655,7 +655,7 @@
year = {2015},
publisher = {Mashable},
howpublished = {\url{https://mashable.com/2015/01/28/redditor-muslim-cab-drivers}},
note = {Accessed on July 1, 2020}
note = {Accessed on October 11, 2021}
}
@book{fuller2009introduction,
@ -751,7 +751,7 @@
@online{gmaps,
title = {Google Maps},
url = {https://google.com/maps},
year = {Accessed on November 11, 2020}
year = {Accessed on October 11, 2021}
}
@article{goldreich1998secure,
@ -1345,7 +1345,7 @@
@online{osm,
title = {OpenStreetMap},
url = {https://openstreetmap.org},
year = {Accessed on November 11, 2020}
year = {Accessed on October 11, 2021}
}
@article{ou2018optimal,
@ -1456,7 +1456,7 @@
@online{ring,
title = {Ring},
url = {https://ring.com},
year = {Accessed on November 11, 2020}
year = {Accessed on October 11, 2021}
}
@book{rogers2000diffusions,
@ -1473,7 +1473,7 @@
year = {2018},
publisher = {TechCrunch},
howpublished = {\url{https://techcrunch.com/2018/01/28/strava-exposes-military-bases}},
note = {Accessed on July 1, 2020}
note = {Accessed on October 11, 2021}
}
@inproceedings{samarati1998protecting,
@ -1667,19 +1667,19 @@
@online{tousanticovid,
title = {TousAntiCovid},
url = {https://bonjour.tousanticovid.gouv.fr},
year = {Accessed on November 11, 2020}
year = {Accessed on October 11, 2021}
}
@online{transunion,
title = {TransUnion},
url = {https://transunion.com},
year = {Accessed on November 11, 2020}
year = {Accessed on October 11, 2021}
}
@online{twitter,
title = {Twitter},
url = {https://twitter.com},
year = {Accessed on November 11, 2020}
year = {Accessed on October 11, 2021}
}
@inproceedings{varghese2016challenges,
@ -1813,7 +1813,7 @@
@online{waze,
title = {Waze},
url = {https://waze.com},
year = {Accessed on November 11, 2020}
year = {Accessed on October 11, 2021}
}
@incollection{wei2006time,
@ -1844,7 +1844,7 @@
@online{wiki,
title = {Wikipedia},
url = {https://wikipedia.com},
year = {Accessed on November 11, 2020}
year = {Accessed on October 11, 2021}
}
@misc{wired2014data,
@ -1853,7 +1853,7 @@
year = {2014},
publisher = {Wired},
howpublished = {\url{https://wired.com/insights/2014/07/data-new-oil-digital-economy}},
note = {Accessed on December 1, 2019}
note = {Accessed on October 11, 2021}
}
@misc{wolfgang1999stochastic,
@ -1904,7 +1904,7 @@
@online{xssfopes2020tweet,
title = {{@xssfopes: ``can anyone spot the issue with the algo? red is original data point, 400 ``anonymized'' data points calculated''}},
url = {https://twitter.com/xssfox/status/1251116087116042241},
year = {Accessed on July 16, 2020}
year = {Accessed on October 11, 2021}
}
@inproceedings{yang2015bayesian,

View File

@ -1,110 +1,131 @@
\section{Experimental Setting and Data Sets}
\section{Setting, configurations, and data sets}
\label{sec:eval-dtl}
In this section we list all the relevant details regarding the evaluation setting (Section~\ref{subsec:eval-setup}), and we present the real and synthetic data sets that we used (Section~\ref{subsec:eval-dat}), along with the corresponding configurations (Section~\ref{subsec:eval-conf}).
\subsection{Machine Setup}
\subsection{Machine setup}
\label{subsec:eval-setup}
We implemented our experiments\footnote{Code available at \url{https://git.delkappa.com/manos/the-last-thing}} in Python $3$.$9$.$7$ and executed them on a machine with an Intel i$7$-$6700$HQ at $3$.$5$GHz CPU and $16$GB RAM, running Manjaro Linux $21$.$1$.$5$.
We repeated each experiment $100$ times and we report the mean over these iterations. \kat{It could be interesting to report also on the diagrams the std}
We implemented our experiments\footnote{Source code available at \url{https://git.delkappa.com/manos/the-last-thing}} in Python $3$.$9$.$7$ and executed them on a machine with an Intel i$7$-$6700$HQ at $3$.$5$GHz CPU and $16$GB RAM, running Manjaro Linux $21$.$1$.$5$.
We repeated each experiment $100$ times and we report the mean over these iterations.
% \kat{It could be interesting to report also on the diagrams the std}
% \mk{I'll keep it in mind.}
\subsection{Data sets}
\label{subsec:eval-dat}
We performed experiments on real (Section~\ref{subsec:eval-dat-real}) and synthetic data sets (Section~\ref{subsec:eval-dat-syn}).
\subsubsection{Real Data Sets}
\subsubsection{Real data sets}
\label{subsec:eval-dat-real}
For uniformity and in order to be consistent, we sample from each of the following data sets the first $1,000$ entries that satisfy the configuration criteria that we discuss in detail in Section~\ref{subsec:eval-conf}.
\paragraph{Copenhagen}~\cite{sapiezynski2019interaction}
data set was collected via the smartphone devices of $851$ university students over a period of $4$ week as part of the Copenhagen Networks Study.
Each device was configured to be discoverable by and to discover nearby Bluetooth devices every $5$ minutes.
Upon discovery, each device registers (i)~the timestamp in seconds, (ii)~the device's unique identifier, (iii)~the unique identifier of the device that it discovered ($- 1$ when no device was found or $- 2$ for any non-participating device), and (iv)~the Received Signal Strength Indicator (RSSI) in dBm.
Half of the devices have registered data at at least $81\%$ of the possible timestamps.
From this data set, we utilized the $1,000$ first contacts out of $12,167$ valid unique contacts of the device with identifier `$449$'. \kat{why only the 1000 first contacts? why device 449? why only one device and not multiple ones, and then report the mean?}
$3$ devices ($449$, $550$, $689$) satisfy our configuration criteria (Section~\ref{subsec:eval-conf}) within their first $1,000$ entries.
From those $3$ devices, we picked the first one, i.e.,~device with identifier `$449$', and utilized its $1,000$ first entries out of $12,167$ unique valid contacts.
% \kat{why only the 1000 first contacts? why device 449? why only one device and not multiple ones, and then report the mean?}
% \mk{I explained why 449 and I added a general explanation in the intro of the subsection.}
\paragraph{HUE}~\cite{makonin2018hue}
contains the hourly energy consumption data of $22$ residential customers of BCHydro, a provincial power utility in British Columbia.
The measurements for each residence are saved individually and each measurement contains (i)~the date (YYYY-MM-DD), (ii)~the hour, and (iii)~the energy consumption in kWh.
In our experiments, we used the first $1,000$ out of $29,231$ measurements of the residence with identifier `$1$', average energy consumption equal to $0.88$kWh, and value range $[0.28$, $4.45]$. \kat{again, explain your choices. Moreover, you make some conclusions later on, based on the characteristics of the data set, for example the density of the measurement values. You should describe all these characteristics in these paragraphs.}
In our experiments, we used the first residence, i.e.,~residence with identifier `$1$', that satisfies our configuration criteria (Section~\ref{subsec:eval-conf}) within its first $1,000$ entries.
In those entries, out of a total of $29,231$ measurements, we estimated an average energy consumption equal to $0.88$kWh and a value range within $[0.28$, $4.45]$.
% \kat{again, explain your choices. Moreover, you make some conclusions later on, based on the characteristics of the data set, for example the density of the measurement values. You should describe all these characteristics in these paragraphs.}
% \mk{OK}
\paragraph{T-drive}~\cite{yuan2010t}
consists of $15$ million GPS data points of the trajectories of $10,357$ taxis in Beijing, spanning a period of $1$ week and a total distance of $9$ million kilometers.
The taxis reported their location data on average every $177$ seconds and $623$ meters approximately.
Each vehicle registers (i)~the taxi unique identifier, (ii)~the timestamp (YYYY-MM-DD HH:MM:SS), (iii)~longitude, and (iv)~latitude.
These measurements are stored individually per vehicle.
We sampled the first $1000$ data items of the taxi with identifier `$2$'.\kat{again, explain your choices}
We sampled the first $1000$ data items of the taxi with identifier `$2$', which satisfied our configuration criteria (Section~\ref{subsec:eval-conf}).
% \kat{again, explain your choices}
% \mk{OK}
\subsubsection{Synthetic}
\label{subsec:eval-dat-syn}
We generated synthetic time series of length equal to $100$ timestamps, for which we varied the number and distribution of {\thethings}.
In this way, we have a controlled data set that we can use to study the behaviour of our proposal.
\kat{more details needed. eg. what is the distributions and number of timestamps used? How many time series you generated? }
We take into account only the temporal order of the points and the position of regular and {\thething} events within the series. \kat{why is the value not important? at the energy consumption, they mattered}
In this way, we have a controlled data set that we can use to study the behavior of our proposal.
% \kat{more details needed. eg. what is the distributions and number of timestamps used? How many time series you generated? }
We take into account only the temporal order of the points and the position of regular and {\thething} events within the time series.
In Section~\ref{subsec:eval-conf}, we explain in more detail our configuration criteria.
% \kat{why is the value not important? at the energy consumption, they mattered}
\subsection{Configurations}
\label{subsec:eval-conf}
% \kat{add some info here.. what are the configurations for? What does landmark percentage refer to, and how does it matter? }
We vary the {\thething} percentage (Section~\ref{subsec:eval-conf-lmdk}), i.e.,~the ratio of timestamps that we attribute to {\thethings} and regular events, in order to explore the behavior of our methodology in all possible scenarios.
For each data set, we implement a privacy mechanism that injects noise related to the type of its attribute values and we tune the parameters of each mechanism accordingly (Section~\ref{subsec:eval-conf-prv}).
Last but not least, we explain how we generate synthetic data sets with various degrees of temporal correlation so as to observe the impact on the overall privacy loss (Section~\ref{subsec:eval-conf-cor}).
\kat{add some info here.. what are the configurations for? What does landmark percentage refer to, and how does it matter? }
\subsubsection{{\Thething} percentage}
In the Copenhagen data set, a landmark represents a time-stamp when a contact device is registered.
We achieve
\label{subsec:eval-conf-lmdk}
In the Copenhagen data set, a {\thething} represents a timestamp when a specific contact device is registered.
After identifying the unique contacts within the sample, we achieve each desired {\thethings} to regular events ratio by considering a list that contains a part of these contact devices.
In more detail, we achieve
$0\%$ {\thethings} by considering an empty list of contact devices,
$20\%$ by extending the list with $[3$, $6$, $11$, $12$, $25$, $29$, $36$, $39$, $41$, $46$, $47$, $50$, $52$, $56$, $57$, $61$, $63$, $78$, $80]$,
$40\%$ with $[81$, $88$, $90$, $97$, $101$, $128$, $130$, $131$, $137$, $145$, $146$, $148$, $151$, $158$, $166$, $175$, $176]$,
$60\%$ with $[181$, $182$, $192$, $195$, $196$, $201$, $203$, $207$, $221$, $230$, $235$, $237$, $239$, $241$, $254]$,
$80\%$ with $[260$, $282$, $287$, $289$, $290$, $291$, $308$, $311$, $318$, $323$, $324$, $330$, $334$, $335$, $344$, $350$, $353$, $355$, $357$, $358$, $361$, $363]$, and
$100\%$ by including all of the possible contacts.
\kat{How did you decide which devices to add at each point?}
% \kat{How did you decide which devices to add at each point?}
% \mk{I discussed it earlier.}
\kat{Say what time-stamps are landmarks in this data set. What is the consumption threshld?}In HUE, we get $0$\%, $20$\% $40$\%, $60$\%, $80$\%, and $100$\% {\thethings} by setting the energy consumption threshold below $0.28$kWh, $1.12$kWh, $0.88$kWh, $0.68$kWh, $0.54$kWh, $4.45$kWh respectively.
% \kat{Say what time-stamps are landmarks in this data set. What is the consumption threshld?}
% \mk{OK}
In HUE, we consider as {\thethings} the events that have energy consumption values below a certain threshold.
That is, we get $0$\%, $20$\% $40$\%, $60$\%, $80$\%, and $100$\% {\thethings} by setting the energy consumption threshold at $0.28$kWh, $1.12$kWh, $0.88$kWh, $0.68$kWh, $0.54$kWh, and $4.45$kWh respectively.
In T-drive, a landmark represents the time-stamp of a stay point. We achieved the desired {\thething} percentages by utilizing the method of Li et al.~\cite{li2008mining} for detecting stay points in trajectory data.
In T-drive, a {\thething} represents a location where a vehicle spend some time.
We achieved the desired {\thething} percentages by utilizing the method of Li et al.~\cite{li2008mining} for detecting stay points in trajectory data.
In more detail, the algorithm checks for each data item if each subsequent item is within a given distance threshold $\Delta l$ and measures the time period $\Delta t$ between the present point and the last subsequent point.
We achieve $0$\%, $20$\% $40$\%, $60$\%, $80$\%, and $100$\% {\thethings} by setting the ($\Delta l$ in meters, $\Delta t$ in minutes) pairs input to the stay point discovery method as [($0$, $1000$), ($2095$, $30$), ($2790$, $30$), ($3590$, $30$), ($4825$, $30$), ($10350$, $30$)]. \kat{how did you come up with these numbers?}
After analyzing the data and experimenting with different pairs of distance and time period, we achieve $0$\%, $20$\% $40$\%, $60$\%, $80$\%, and $100$\% {\thethings} by setting the ($\Delta l$ in meters, $\Delta t$ in minutes) pairs input to the stay point discovery method as [($0$, $1000$), ($2095$, $30$), ($2790$, $30$), ($3590$, $30$), ($4825$, $30$), ($10350$, $30$)].
% \kat{how did you come up with these numbers?}
We generated synthetic data with \emph{skewed} (the {\thethings} are distributed towards the beginning/end of the series), \emph{symmetric} (in the middle), \emph{bimodal} (both end and beginning), and \emph{uniform} (all over the time series) {\thething} distributions.
In order to get {\thethings} with the above distribution features, we generate probability distributions with appropriate characteristics and sample from them, without replacement, the desired number of points.
%The generated distributions are representative of the cases that we wish to examine during the experiments.
For a left-skewed {\thething} distribution we would utilize a truncated distribution resulting from the restriction of the domain of a distribution to the beginning and end of the time series with its location shifted to the center of the right half of the series.
For a right-skewed ....
For a symmetric ..
For a bimodal ..
For uniform ...
\kat{repeat for all kinds of distributions}
For consistency, we calculate the scale parameter depending on the length of the series by setting it equal to the series' length over a constant.
\kat{The following paragraph does not belong in this section..}
Notice that in our experiments, in the cases when we have $0\%$ and $100\%$ of the events being {\thethings}, we get the same behavior as in event- and user-level privacy respectively.
This happens due the fact that at each timestamp we take into account only the data items at the current timestamp and ignore the rest of the time series (event-level) when there are no {\thethings}.
Whereas, when each timestamp corresponds to a {\thething} we consider and protect all the events throughout the entire series (user-level).
In order to get {\thething} sets with the above distribution features, we generate probability distributions with restricted domain to the beginning and end of the time series, and sample from them, without replacement, the desired number of points.
For each case, we place the location, i.e.,~centre, of the distribution accordingly.
That is, for symmetric we put the location in the middle of the time series and for left/right skewed to the right/left.
For bimodal we combine two mirrored skewed distributions.
Finally, for the uniform distribution we distribute the {\thethings} randomly throughout the time series.
For consistency, we calculate the scale parameter of the corresponding distribution depending on the length of the time series by setting it equal to the series' length over a constant.
\subsubsection{Privacy parameters}
\label{subsec:eval-conf-prv}
% \kat{Explain why you select each of these perturbation mechanisms for each of the datasets. Is the random response differential private? Mention it! }
For all of te real data sets, we implement $\varepsilon$-differential privacy by selecting a mechanism, from those that we described in Section~\ref{subsec:prv-mech}, that is best suited for the type of its sensitive attributes.
To perturb the contact tracing data of the Copenhagen data set, we utilize the \emph{random response} technique~\cite{wang2017locally}, and at each timestamp we report truthfully, with probability $p = \frac{e^\varepsilon}{e^\varepsilon + 1}$, whether the current contact is a {\thething} or not.
We randomize the energy consumption in HUE with the Laplace mechanism~\cite{dwork2014algorithmic}.
For T-drive, we perturb the location data with noise that we sample from the Planar Laplace mechanism~\cite{andres2013geo}.
\kat{Explain why you select each of these perturbation mechanisms for each of the datasets. Is the random response differential private? Mention it! }
To perturb the contact tracing data of the Copenhagen data set, we utilize the \emph{random response} technique~\cite{wang2017locally} to report with probability $p = \frac{e^\varepsilon}{e^\varepsilon + 1}$ whether the current contact is a {\thething} or not.
We randomize the energy consumption in HUE with the Laplace mechanism (described in detail in Section~\ref{subsec:prv-mech}).
We inject noise to the spatial values in T-drive that we sample from the Planar Laplace mechanism~\cite{andres2013geo}.
We set the privacy budget $\varepsilon = 1$, and, for simplicity, we assume that for every query sensitivity it holds that $\Delta f = 1$. \kat{why don't you consider other values as well?}
For the experiments performed on the synthetic data sets, the original values to be released do not influence the outcome of our conclusions, thus we ignore them.
\kat{why are the values not important for the synthetic dataset? This seems a little weird, when said out of context.. our goal is to perturb the values, but do not really care about the way we perturb our values?}
We set the privacy budget $\varepsilon = 1$ for all of our experiments and, for simplicity, we assume that for every query sensitivity it holds that $\Delta f = 1$.
% \kat{why don't you consider other values as well?}
For the experiments that we performed on the synthetic data sets, the original values to be released are not relevant to what we we to observe, and thus we ignore them.
% \kat{why are the values not important for the synthetic dataset? This seems a little weird, when said out of context.. our goal is to perturb the values, but do not really care about the way we perturb our values?}
% Finally, notice that, depending on the results' variation, most diagrams are in logarithmic scale.
\subsubsection{Temporal correlation}
\kat{Did you find any correlation in the other data? Do you need the correlation matrix to be known a priori? Describe a little why you did not use the real data for correlations }
\label{subsec:eval-conf-cor}
% \kat{Did you find any correlation in the other data? Do you need the correlation matrix to be known a priori? Describe a little why you did not use the real data for correlations }
Despite the inherent presence of temporal correlation in time series, it is challenging to correctly discover and quantify it.
For this reason, and in order to create a more controlled environment for our experiments, we chose to conduct tests relevant to temporal correlation using synthetic data sets.
We model the temporal correlation in the synthetic data as a \emph{stochastic matrix} $P$, using a \emph{Markov Chain}~\cite{gagniuc2017markov}.
$P$ is a $n \times n$ matrix, where the element $P_{ij}$
$P$ is an $n \times n$ matrix, where the element $P_{ij}$
%at the $i$th row of the $j$th column that
represents the transition probability from a state $i$ to another state $j$.
%, $\forall i, j \leq n$.
represents the transition probability from a state $i$ to another state $j$, $\forall$ $i$, $j$ $\leq$ $n$.
It holds that the elements of every row $j$ of $P$ sum up to $1$.
We follow the \emph{Laplacian smoothing} technique~\cite{sorkine2004laplacian} as utilized in~\cite{cao2018quantifying} to generate the matrix $P$ with a degree of temporal correlation $s > 0$ equal to
We follow the \emph{Laplacian smoothing} technique~\cite{sorkine2004laplacian}, as utilized in~\cite{cao2018quantifying}, to generate the matrix $P$ with a degree of temporal correlation $s > 0$ equal to
% and generate a stochastic matrix $P$ with a degree of temporal correlation $s$ by calculating each element $P_{ij}$ as follows
$$\frac{(I_{n})_{ij} + s}{\sum_{k = 1}^{n}((I_{n})_{jk} + s)}$$
where $I_{n}$ is an \emph{identity matrix} of size $n$.

View File

@ -1,10 +1,9 @@
\chapter{Evaluation}
\label{ch:eval}
In this chapter we present the experiments that we performed in order to evaluate {\thething} Privacy (Chapter~\ref{ch:lmdk-prv}) on real and synthetic data sets.
In this chapter we present the experiments that we performed in order to evaluate {\thething} privacy (Chapter~\ref{ch:lmdk-prv}) on real and synthetic data sets.
Section~\ref{sec:eval-dtl} contains all the details regarding the data sets the we used for our experiments along with the system configurations.
Section~\ref{sec:eval-lmdk} evaluates the data utility of the {\thething} privacy mechanisms that we designed in Section~\ref{sec:thething} and investigates the behavior of the privacy loss under temporal correlation for different distributions of {\thethings}.
Section~\ref{sec:eval-lmdk-sel} justifies our decisions while designing the privacy-preserving {\thething} selection component in Section~\ref{sec:theotherthing} and the data utility impact of the latter.
Section~\ref{sec:eval-lmdk-sel} justifies our decisions while designing the privacy-preserving {\thething} selection mechanism in Section~\ref{sec:theotherthing} and the data utility impact of the latter.
Finally, Section~\ref{sec:eval-sum} concludes this chapter by summarizing the main results derived from the experiments.
\input{evaluation/details}

View File

@ -1,9 +1,10 @@
\section{Summary}
\label{sec:eval-sum}
In this chapter we presented the experimental evaluation of the {\thething} privacy mechanisms and the privacy-preserving {\thething} selection mechanism that we developed in Chapter~\ref{ch:lmdk-prv}, on real and synthetic data sets.
The Adaptive mechanism is the most reliable and best performing mechanism, in terms of overall data utility, with minimal tuning across most cases.
In this chapter we presented the experimental evaluation of the {\thething} privacy schemes and the privacy-preserving {\thething} selection scheme that we developed in Chapter~\ref{ch:lmdk-prv}, on real and synthetic data sets.
The Adaptive scheme is the most reliable and best performing scheme, in terms of overall data utility, with minimal tuning across most of the cases.
Skip performs optimally in data sets with a smaller target value range, where approximation fits best.
The {\thething} selection component introduces a reasonable data utility decline to all of our mechanisms however, the Adaptive handles it well and bounds the data utility to higher levels compared to user-level protection.\kat{it would be nice to see it clearly on Figure 5.5. (eg, by including another bar that shows adaptive without landmark selection)}
The {\thething} selection module introduces a reasonable data utility decline to all of our schemes however, the Adaptive handles it well and bounds the data utility to higher levels compared to user-level protection.
% \kat{it would be nice to see it clearly on Figure 5.5. (eg, by including another bar that shows adaptive without landmark selection)}
% \mk{Done.}
In terms of temporal correlation, we observe that under moderate and strong temporal correlation, a greater average regular--{\thething} event distance in a {\thething} distribution causes greater overall privacy loss.
Finally, the contribution of the {\thething} privacy on enhancing the data quality, while preserving $\epsilon$-differential privacy is demonstrated by the fact that the selected, Adaptive mechanism provides better data quality than the user-level mechanism.
Finally, the contribution of the {\thething} privacy on enhancing the data utility, while preserving $\epsilon$-differential privacy, is demonstrated by the fact that the selected Adaptive scheme provides better data utility than the user-level privacy protection.

View File

@ -1,25 +1,26 @@
\section{Selection of landmarks}
\section{Selection of {\thethings}}
\label{sec:eval-lmdk-sel}
In this section, we present the experiments on the methodology for the {\thethings} selection presented in Section~\ref{subsec:lmdk-sel-sol}, on the real and the synthetic data sets.
With the experiments on the synthetic data sets (Section~\ref{subsec:sel-utl}) we show the normalized Euclidean and Wasserstein distances \kat{is this distance the landmark distance that we saw just before ? clarify } of the time series histograms for various distributions and {\thething} percentages.
In this section, we present the experiments on the methodology for the {\thething} selection presented in Section~\ref{subsec:lmdk-sel-sol}, on the real and synthetic data sets.
With the experiments on the synthetic data sets (Section~\ref{subsec:sel-utl}) we show the normalized Euclidean and Wasserstein distance metrics (not to be confused with the temporal distances in Figure~\ref{fig:avg-dist})
% \kat{is this distance the landmark distance that we saw just before ? clarify }
of the time series histograms for various distributions and {\thething} percentages.
This allows us to justify our design decisions for our concept that we showcased in Section~\ref{subsec:lmdk-sel-sol}.
With the experiments on the real data sets (Section~\ref{subsec:sel-prv}), we show the performance in terms of utility of our three {\thething} mechanisms in combination with the privacy preserving {\thething} selection component.
\kat{Mention whether it improves the original proposal or not.}
With the experiments on the real data sets (Section~\ref{subsec:sel-prv}), we show the performance in terms of utility of our three {\thething} mechanisms in combination with the privacy-preserving {\thething} selection mechanism, which enhances the privacy protection that our concept provides.
% \kat{Mention whether it improves the original proposal or not.}
\subsection{{\Thething} selection utility metrics}
\label{subsec:sel-utl}
Figure~\ref{fig:sel-dist} demonstrates the normalized distance that we obtain when we utilize either (a)~the Euclidean or (b)~the Wasserstein distance metric to obtain a set of {\thethings} including regular events.
\begin{figure}[htp]
\centering
\subcaptionbox{Euclidean\label{fig:sel-dist-norm}}{%
\includegraphics[width=.5\linewidth]{evaluation/sel-dist-norm}%
\includegraphics[width=.49\linewidth]{evaluation/sel-dist-norm}%
}%
\hfill
\subcaptionbox{Wasserstein\label{fig:sel-dist-emd}}{%
\includegraphics[width=.5\linewidth]{evaluation/sel-dist-emd}%
\includegraphics[width=.49\linewidth]{evaluation/sel-dist-emd}%
}%
\caption{The normalized (a)~Euclidean, and (b)~Wasserstein distance of the generated {\thething} sets for different {\thething} percentages.}
\label{fig:sel-dist}
@ -29,67 +30,71 @@ Comparing the results of the Euclidean distance in Figure~\ref{fig:sel-dist-norm
% (1 + (0.25 + 0.25 + 0.45 + 0.45)/4 + (0.25 + 0.25 + 0.3 + 0.3)/4 + (0.2 + 0.2 + 0.2 + 0.2)/4 + (0.15 + 0.15 + 0.15 + 0.15)/4)/6
% (1 + (0.1 + 0.1 + 0.25 + 0.25)/4 + (0.075 + 0.075 + .15 + 0.15)/4 + (0.075 + 0.075 + 0.1 + 0.1)/4 + (0.025 + 0.025 + 0.025 + 0.025)/4)/6
The maximum difference per {\thething} percentage is approximately $0.2$ for the former and $0.15$ for the latter between the bimodal and skewed {\thething} distributions.
Overall, the Euclidean achieves a mean normalized distance of $0.3$ and the Wasserstein $0.2$.
Therefore, and by observing Figure~\ref{fig:sel-dist}, the Wasserstein distance demonstrates a less consistent performance and less linear behavior among all possible {\thething} distributions.
Thus, we choose to utilize the Euclidean distance metric for the implementation of the privacy-preserving {\thething} selection in Section~\ref{subsec:lmdk-sel-sol}.
Overall, the Euclidean distance achieves a mean normalized distance of $0.3$ while the Wasserstein distance a mean normalized distance that is equal to $0.2$.
Therefore, and by observing Figure~\ref{fig:sel-dist}, Wasserstein demonstrates a less consistent performance and less linear behavior among all possible {\thething} distributions.
Thus, we choose to utilize the Euclidean distance metric for the implementation of the privacy-preserving {\thething} selection mechanism in Section~\ref{subsec:lmdk-sel-sol}.
\subsection{Privacy budget tuning}
\label{subsec:sel-eps}
In Figure~\ref{fig:sel-eps} we test the Uniform model in real data by investing different ratios ($1$\%, $10$\%, $25$\%, and $50$\%) of the available privacy budget $\varepsilon$ in the {\thething} selection mechanism, in order to figure out the optimal ratio value.
In Figure~\ref{fig:sel-eps} we test the Uniform mechanism in real data by investing different ratios ($1$\%, $10$\%, $25$\%, and $50$\%) of the available privacy budget $\varepsilon$ in the {\thething} selection mechanism and the remaining to perturbing the original data values, in order to figure out the optimal ratio value.
Uniform is our baseline implementation, and hence allows us to derive more accurate conclusions in this case.
In general, greater ratios will result in more accurate, i.e.,~smaller, {\thething} sets and less accurate values in the released data sets.
In general, we are expecting to observe that greater ratios will result in more accurate, i.e.,~smaller, {\thething} sets and less accurate values in the released data.
\begin{figure}[htp]
\centering
\subcaptionbox{Copenhagen\label{fig:copenhagen-sel-eps}}{%
\includegraphics[width=.5\linewidth]{evaluation/copenhagen-sel-eps}%
\includegraphics[width=.49\linewidth]{evaluation/copenhagen-sel-eps}%
}%
\hspace{\fill}
\\ \bigskip
\subcaptionbox{HUE\label{fig:hue-sel-eps}}{%
\includegraphics[width=.5\linewidth]{evaluation/hue-sel-eps}%
\includegraphics[width=.49\linewidth]{evaluation/hue-sel-eps}%
}%
\hfill
\subcaptionbox{T-drive\label{fig:t-drive-sel-eps}}{%
\includegraphics[width=.5\linewidth]{evaluation/t-drive-sel-eps}%
\includegraphics[width=.49\linewidth]{evaluation/t-drive-sel-eps}%
}%
\caption{The mean absolute error (a)~as a percentage, (b)~in kWh, and (c)~in meters of the released data for different {\thething} percentages. We apply the Uniform {\thething} privacy model and vary the ratio of the privacy budget $\varepsilon$ that we allocate to the {\thething} selection component.}
\caption{The mean absolute error (a)~as a percentage, (b)~in kWh, and (c)~in meters of the released data for different {\thething} percentages. We apply the Uniform {\thething} privacy mechanism and vary the ratio of the privacy budget $\varepsilon$ that we allocate to the {\thething} selection mechanism.}
\label{fig:sel-eps}
\end{figure}
The application of the randomized response mechanism, in the Copenhagen data set, is tolerant to the fluctuations of the privacy budget.
For HUE and T-drive, we observe that our implementation performs better for lower ratios, e.g.,~$0.01$, where we end up allocating the majority of the available privacy budget to the data release process instead of the {\thething} selection component.
The results of this experiment indicate that we can safely allocate the majority of $\varepsilon$ for publishing the data values, and therefore achieve better data utility, while providing more robust privacy protection to the {\thething} timestamp set.
The application of the randomized response mechanism, in the Copenhagen data set (Figure~\ref{fig:copenhagen-sel-eps}), is tolerant to the fluctuations of the privacy budget and maintains a relatively constant performance in terms of data utility.
For HUE (Figure~\ref{fig:hue-sel-eps}) and T-drive (Figure~\ref{fig:t-drive-sel-eps}), we observe that our implementation performs better for lower ratios, e.g.,~$0.01$, where we end up allocating the majority of the available privacy budget to the data release process instead of the {\thething} selection mechanism.
The results of this experiment indicate that we can safely allocate the majority of $\varepsilon$ for publishing the data values, and therefore achieve better data utility, while providing more robust privacy protection to the {\thething} set.
\subsection{Budget allocation and {\thething} selection}
\label{subsec:sel-prv}
Figure~\ref{fig:real-sel} exhibits the performance of Skip, Uniform, and Adaptive (see Section~\ref{subsec:lmdk-mechs}) in combination with the {\thething} selection component.
Figure~\ref{fig:real-sel} exhibits the performance of Skip, Uniform, and Adaptive mechanisms (presented in detail in Section~\ref{subsec:lmdk-mechs}) in combination with the {\thething} selection mechanism (Section~\ref{subsec:lmdk-sel-sol}).
\begin{figure}[htp]
\centering
\subcaptionbox{Copenhagen\label{fig:copenhagen-sel}}{%
\includegraphics[width=.5\linewidth]{evaluation/copenhagen-sel}%
\includegraphics[width=.49\linewidth]{evaluation/copenhagen-sel}%
}%
\hspace{\fill}
\hfill
\\ \bigskip
\subcaptionbox{HUE\label{fig:hue-sel}}{%
\includegraphics[width=.5\linewidth]{evaluation/hue-sel}%
\includegraphics[width=.49\linewidth]{evaluation/hue-sel}%
}%
\hfill
\subcaptionbox{T-drive\label{fig:t-drive-sel}}{%
\includegraphics[width=.5\linewidth]{evaluation/t-drive-sel}%
\includegraphics[width=.49\linewidth]{evaluation/t-drive-sel}%
}%
\caption{The mean absolute error (a)~as a percentage, (b)~in kWh, and (c)~in meters of the released data for different {\thething} percentages.}
\caption{
The mean absolute error (a)~as a percentage, (b)~in kWh, and (c)~in meters of the released data, for different {\thething} percentages, with the incorporation of the privacy-preserving {\thething} selection mechanism.
The light short horizontal lines indicate the corresponding measurements from Figure~\ref{fig:real} without the {\thething} selection mechanism.
}
\label{fig:real-sel}
\end{figure}
In comparison with the utility performance without the {\thething} selection component (Figure~\ref{fig:real}), we notice a slight deterioration for all three models.
This is natural since we allocated part of the available privacy budget to the privacy-preserving {\thething} selection component, which in turn increased the number of {\thethings}.
Therefore, there is less privacy budget available for data publishing throughout the time series for $0$\% and $100$\% {\thethings}.
\kat{why not for the other percentages?}
Skip performs best in our experiments with HUE, due to the low range in the energy consumption and the high scale of the Laplace noise, which it avoids due to the employed approximation.
However, for the Copenhagen data set and T-drive Skip attains greater mean absolute error than the user-level protection scheme, which exposes no benefit w.r.t. user-level protection.
Overall, Adaptive has a consistent performance in terms of utility for all of the data sets that we experimented with, and always outperforms the user-level privacy.
In comparison with the utility performance without the {\thething} selection mechanism (light short horizontal lines), we notice a slight deterioration for all three mechanisms.
This is natural since we allocated part of the available privacy budget to the privacy-preserving {\thething} selection mechanism, which in turn increased the number of {\thethings}, except for the case of $100$\% {\thethings}.
Therefore, there is less privacy budget available for data publishing throughout the time series.
% for $0$\% and $100$\% {\thethings}.
% \kat{why not for the other percentages?}
Skip performs best in our experiments with HUE (Figure~\ref{fig:hue-sel}), due to the low range in the energy consumption and the high scale of the Laplace noise that it avoids due to the employed approximation.
However, for the Copenhagen data set (Figure~\ref{fig:copenhagen-sel}) and T-drive (Figure~\ref{fig:t-drive-sel}), Skip attains high mean absolute error, which exposes no benefit with respect to user-level protection.
Overall, Adaptive has a consistent performance in terms of utility for all of the data sets that we experimented with, and almost always outperforms the user-level privacy protection.
Thus, it is selected as the best mechanism to use in general.

View File

@ -1,74 +1,98 @@
\section{Landmark events}
\section{{\Thething} events}
\label{sec:eval-lmdk}
% \kat{After discussing with Dimitris, I thought you are keeping one chapter for the proposals of the thesis. In this case, it would be more clean to keep the theoretical contributions in one chapter and the evaluation in a separate chapter. }
% \mk{OK.}
In this section, we present the experiments that we performed, to test the methodology that we presented in Section~\ref{subsec:lmdk-sol}, on real and synthetic data sets.
With the experiments on the real data sets (Section~\ref{subsec:lmdk-expt-bgt}), we show the performance in terms of data utility of our three {\thething} privacy budget allocation schemes: Skip, Uniform and Adaptive.
We define data utility as the Mean Absolute Error introduced by the privacy mechanism.
We compare with the event and user differential privacy, and show that in the general case, {\thething} privacy allows for better data utility than user differential privacy.
With the experiments on the synthetic data sets (Section~\ref{subsec:lmdk-expt-cor}) we show the privacy loss \kat{in the previous set of experiments we were measuring the MAE, now we are measuring the privacy loss... Why is that? Isn't it two sides of the same coin? }by our framework when tuning the size and statistical characteristics of the input {\thething} set $L$ with special emphasis on how the privacy loss under temporal correlation is affected by the number and distribution of the {\thethings}.
\kat{mention briefly what you observe}
With the experiments on the real data sets (Section~\ref{subsec:lmdk-expt-bgt}), we show the performance in terms of data utility of our three {\thething} privacy mechanisms: Skip, Uniform and Adaptive.
We define data utility as the mean absolute error introduced by the privacy mechanism.
We compare with the event- and user-level differential privacy protection levels, and show that, in the general case, {\thething} privacy allows for better data utility than user-level differential privacy while balancing between the two protection levels.
With the experiments on the synthetic data sets (Section~\ref{subsec:lmdk-expt-cor}) we show the overall privacy loss,
% \kat{in the previous set of experiments we were measuring the MAE, now we are measuring the privacy loss... Why is that? Isn't it two sides of the same coin? }
i.e.,~the privacy budget $\varepsilon$ with the extra privacy loss because of the temporal correlation, under temporal correlation within our framework when tuning the size and statistical characteristics of the input {\thething} set $L$.
% \kat{mention briefly what you observe}
We observe that a greater average {\thething}--regular event distance in a time series can result into greater overall privacy loss under moderate and strong temporal correlation.
\subsection{Budget allocation schemes}
\label{subsec:lmdk-expt-bgt}
Figure~\ref{fig:real} exhibits the performance of the three mechanisms, Skip, Uniform, and Adaptive applied on the three data sets that we study.
Notice that, in the cases when we have $0\%$ and $100\%$ of the events being {\thethings}, we get the same behavior as in event- and user-level privacy respectively.
This happens due the fact that at each timestamp we take into account only the data items at the current timestamp and ignore the rest of the time series (event-level) when there are no {\thethings}.
Whereas, when each timestamp corresponds to a {\thething} we consider and protect all the events throughout the entire series (user-level).
% For the Geolife data set (Figure~\ref{fig:geolife}), Skip has the best performance (measured in Mean Absolute Error, in meters) because it invests the most budget overall at every regular event, by approximating the {\thething} data based on previous releases.
% Due to the data set's high density (every $1$--$5$ seconds or every $5$--$10$ meters per point) approximating constantly has a low impact on the data utility.
% On the contrary, the lower density of the T-drive data set (Figure~\ref{fig:t-drive}) has a negative impact on the performance of Skip.
For the Copenhagen data set (Figure~\ref{fig:copenhagen}), Adaptive has a constant\kat{it is not constant, for 0 it is much lower} overall performance and performs best for $0$\%, $60$\%, and $80$\% {\thethings} \kat{this is contradictory: you say that it is constant overall, and then that it is better for certain percentages. }.
We notice that for $0$\% {\thethings}, it achieves better utility than the event-level protection.\kat{what does this mean? how is it possible?}
The Skip model excels, compared to the others, at cases where it needs to approximate $20$\%--$40$\% or $100$\% of the times.\kat{it seems a little random.. do you have an explanation? (rather few times or all?)}
The combination of the small range of measurements in HUE ($[0.28$, $4.45]$ with an average of $0.88$kWh) and the large scale in the Laplace mechanism, results in a low mean absolute error for Skip (Figure~\ref{fig:hue}).
In general, a scheme that favors approximation over noise injection would achieve a better performance in this case.
\kat{why?explain}
However, the Adaptive model performs by far better than Uniform and strikes a nice balance\kat{???} between event- and user-level protection for all {\thething} percentages.
In the T-drive data set (Figure~\ref{fig:t-drive}), the Adaptive mechanism outperforms Uniform by $10$\%--$20$\% for all {\thething} percentages greater than $40$\% and Skip by more than $20$\%.
The lower density (average distance of $623$m) of the T-drive data set has a negative impact on the performance of Skip; republishing a previous perturbed value is now less accurate than perturbing the new location.
\begin{figure}[htp]
\centering
\subcaptionbox{Copenhagen\label{fig:copenhagen}}{%
\includegraphics[width=.5\linewidth]{evaluation/copenhagen}%
\includegraphics[width=.49\linewidth]{evaluation/copenhagen}%
}%
\hspace{\fill}
\\ \bigskip
\subcaptionbox{HUE\label{fig:hue}}{%
\includegraphics[width=.5\linewidth]{evaluation/hue}%
\includegraphics[width=.49\linewidth]{evaluation/hue}%
}%
\hfill
\subcaptionbox{T-drive\label{fig:t-drive}}{%
\includegraphics[width=.5\linewidth]{evaluation/t-drive}%
\includegraphics[width=.49\linewidth]{evaluation/t-drive}%
}%
\caption{The mean absolute error (a)~as a percentage, (b)~in kWh, and (c)~in meters of the released data for different {\thething} percentages.}
\label{fig:real}
\end{figure}
In general, we can claim that the Adaptive is the most reliable and best performing mechanism with minimal tuning\kat{what does minimal tuning mean?}, if we take into consideration the drawbacks of the Skip mechanism mentioned in Section~\ref{subsec:lmdk-mechs}. \kat{you can mention them also here briefly, and give the pointer for the section}
Moreover, designing a data-dependent sampling scheme \kat{what would be the main characteristic of the scheme? that it picks landmarks how?} would possibly\kat{possibly is not good enough, if you are sure remove it. Otherwise mention that more experiments need to be done?} result in better results for Adaptive.
For the Copenhagen data set (Figure~\ref{fig:copenhagen}), Adaptive has an
% constant
% \kat{it is not constant, for 0 it is much lower}
overall consistent performance and works best for $60$\% and $80$\% {\thethings}.
% \kat{this is contradictory: you say that it is constant overall, and then that it is better for certain percentages. }.
% \mk{`Consistent' is the right word.}
We notice that for $0$\% {\thethings}, it achieves better utility than the event-level protection
% \kat{what does this mean? how is it possible?}
due to the combination of more available privacy budget per timestamp (due to the absence of {\thethings}) and its adaptive sampling methodology.
Skip excels, compared to the others, at cases where it needs to approximate $20$\%, $40$\%, or $100$\% of the times.
% \kat{it seems a little random.. do you have an explanation? (rather few times or all?)}
In general, we notice that, for this data set and due to the application of the random response technique, it is more beneficial to either invest more privacy budget per event or prefer approximation over introducing randomization.
The combination of the small range of measurements ($[0.28$, $4.45]$ with an average of $0.88$kWh) in HUE (Figure~\ref{fig:hue}) and the large scale in the Laplace mechanism, allows for mechanisms that favor approximation over noise injection to achieve a better performance in terms of data utility.
Hence, Skip achieves a constant low mean absolute error.
% \kat{why?explain}
Regardless, the Adaptive mechanism performs by far better than Uniform and
% strikes a nice balance\kat{???}
balances between event- and user-level protection for all {\thething} percentages.
In T-drive (Figure~\ref{fig:t-drive}), the Adaptive mechanism outperforms Uniform by $10$\%--$20$\% for all {\thething} percentages greater than $40$\% and Skip by more than $20$\%.
The lower density (average distance of $623$m) of the T-drive data set has a negative impact on the performance of Skip because republishing a previous perturbed value is now less accurate than perturbing the current location.
Principally, we can claim that the Adaptive is the most reliable and best performing mechanism,
% with a minimal and generic parameter tuning
% \kat{what does minimal tuning mean?}
if we take into consideration the drawbacks of the Skip mechanism, particularly in spatiotemporal data, e.g., sporadic location data publishing~\cite{gambs2010show, russell2018fitness} or misapplying location cloaking~\cite{xssfopes2020tweet}, that could lead to the indication of privacy-sensitive attribute values.
% (mentioned in Section~\ref{subsec:lmdk-mechs})
% \kat{you can mention them also here briefly, and give the pointer for the section}
Moreover, implementing a more advanced and data-dependent sampling method
% \kat{what would be the main characteristic of the scheme? that it picks landmarks how?}
that accounts for changes in the trends of the input data and adapts its rate accordingly, would
% possibly
% \kat{possibly is not good enough, if you are sure remove it. Otherwise mention that more experiments need to be done?}
result in a more effective budget allocation that would improve the performance of Adaptive in terms of data utility.
\subsection{Temporal distance and correlation}
\label{subsec:lmdk-expt-cor}
As previously mentioned, temporal correlations are inherent in continuous publishing, and they are the cause of supplementary privacy leakage in the case of privacy preserving data publication.
In this section, we are interested in studying the effect that the distance of the {\thethings} from every event have on the leakage caused by temporal correlations.
As previously mentioned, temporal correlation is inherent in continuous publishing, and it is the cause of supplementary privacy loss in the case of privacy-preserving time series publishing.
In this section, we are interested in studying the effect that the distance of the {\thethings} from every regular event has on the loss caused under the presence of temporal correlation.
Figure~\ref{fig:avg-dist} shows a comparison of the average temporal distance of the events from the previous/next {\thething} or the start/end of the time series for various distributions in our synthetic data.
More specifically, we model the distance of an event as the count of the total number of events between itself and the nearest {\thething} or the series edge.
More specifically, we model the distance of an event as the count of the total number of events between itself and the nearest {\thething} or the time series edge.
\begin{figure}[htp]
\centering
\includegraphics[width=.5\linewidth]{evaluation/avg-dist}%
\caption{Average temporal distance of the events from the {\thethings} for different {\thethings} percentages within a time series in various {\thethings} distributions.}
\caption{Average temporal distance of regular events from the {\thethings} for different {\thethings} percentages within a time series in various {\thething} distributions.}
\label{fig:avg-dist}
\end{figure}
@ -76,31 +100,42 @@ We observe that the uniform and bimodal distributions tend to limit the regular
This is due to the fact that the former scatters the {\thethings}, while the latter distributes them on both edges, leaving a shorter space uninterrupted by {\thethings}.
% and as a result they reduce the uninterrupted space by landmarks in the sequence.
On the contrary, distributing the {\thethings} at one part of the sequence, as in skewed or symmetric, creates a wider space without {\thethings}.
This study provides us with different distance settings that we are going to use in the subsequent temporal leakage study.
This study provides us with different distance settings that we are going to use in the subsequent overall privacy loss study.
Figure~\ref{fig:dist-cor} illustrates a comparison among the aforementioned distributions regarding the overall privacy loss under (a)~weak, (b)~moderate, and (c)~strong temporal correlation degrees.
The line shows the overall privacy loss---for all cases of {\thethings} distribution---without temporal correlation.
The line shows the overall privacy loss---for all cases of {\thething} distribution---without temporal correlation.
\begin{figure}[htp]
\centering
\subcaptionbox{Weak correlation\label{fig:dist-cor-wk}}{%
\includegraphics[width=.5\linewidth]{evaluation/dist-cor-wk}%
\includegraphics[width=.49\linewidth]{evaluation/dist-cor-wk}%
}%
\hspace{\fill}
\hfill
\\ \bigskip
\subcaptionbox{Moderate correlation\label{fig:dist-cor-mod}}{%
\includegraphics[width=.5\linewidth]{evaluation/dist-cor-mod}%
\includegraphics[width=.49\linewidth]{evaluation/dist-cor-mod}%
}%
\hfill
\subcaptionbox{Strong correlation\label{fig:dist-cor-stg}}{%
\includegraphics[width=.5\linewidth]{evaluation/dist-cor-stg}%
\includegraphics[width=.49\linewidth]{evaluation/dist-cor-stg}%
}%
\caption{Privacy loss \kat{what is the unit for privacy loss? I t should appear on the diagram} for different {\thethings} percentages and distributions under (a)~weak, (b)~moderate, and (c)~strong degrees of temporal correlation.
The line shows the overall privacy loss without temporal correlation.}
\caption{
The overall privacy loss (privacy budget $\varepsilon$)
% \kat{what is the unit for privacy loss? I t should appear on the diagram}
% \mk{It's the privacy budget epsilon}
for different {\thething} percentages and distributions under (a)~weak, (b)~moderate, and (c)~strong degrees of temporal correlation.
The line shows the overall privacy loss without temporal correlation.
}
\label{fig:dist-cor}
\end{figure}
In combination with Figure~\ref{fig:avg-dist}, we conclude that a greater average event--{\thething} event \kat{it was even, I changed it to event but do not know what youo want ot say} distance in a distribution can result into greater overall privacy loss under moderate and strong temporal correlation.
In combination with Figure~\ref{fig:avg-dist}, we conclude that a greater average {\thething}--regular event
% \kat{it was even, I changed it to event but do not know what youo want ot say}
% \mk{Fixed}
distance in a distribution can result into greater overall privacy loss under moderate and strong temporal correlation.
This is due to the fact that the backward/forward privacy loss accumulates more over time in wider spaces without {\thethings} (see Section~\ref{sec:correlation}).
Furthermore, the behavior of the privacy loss is as expected regarding the temporal correlation degree: a stronger correlation degree generates higher privacy loss while widening the gap between the different distribution cases.
On the contrary, a weaker correlation degree makes it harder to differentiate among the {\thethings} distributions.
The privacy loss under a weak correlation degree converge \kat{with what?}.
On the contrary, a weaker correlation degree makes it harder to differentiate among the {\thething} distributions.
The privacy loss under a weak correlation degree converge
% \kat{with what?}
with all possible distributions for all {\thething} percentages.

View File

@ -1,2 +1,4 @@
\section{Contribution}
\label{sec:contr}
\mk{WIP}

View File

@ -1,12 +1,13 @@
\chapter{Introduction}
\label{ch:intro}
Data privacy is becoming an increasingly important issue both at a technical and at a societal level, and introduces various challenges ranging from the way we share and publish data sets to the way we use online and mobile services.
Data privacy is becoming an increasingly important issue, both at a technical and at a societal level, and introduces various challenges ranging from the way we share and publish data sets to the way we use online and mobile services.
Personal information, also described as \emph{microdata}, acquired increasing value and are in many cases used as the `currency'~\cite{economist2016data} to pay for access to various services, i.e.,~users are asked to exchange their personal information with the service provided.
This is particularly true for many \emph{Location-Based Services} (LBSs), e.g.,~Google Maps~\cite{gmaps}, Waze~\cite{waze}, etc.; these services exchange their `free' service with collecting and using user-generated data, such as timestamped geolocalized information.
Besides navigation and location-based services, social media applications (e.g.,~Facebook~\cite{facebook}, Twitter~\cite{twitter}, Foursquare~\cite{foursquare}, etc.) take advantage of user-generated and user-related data, to make relevant recommendations and show personalized advertisement.
This is particularly true for many \emph{Location-Based Services} (LBSs), e.g.,~Google Maps~\cite{gmaps}, Waze~\cite{waze}, etc.
These services exchange their `free' service with collecting and using user-generated data, such as timestamped geolocalized information.
Besides navigation and location-based services, social media applications, e.g.,~Facebook~\cite{facebook}, Twitter~\cite{twitter}, Foursquare~\cite{foursquare}, etc. take advantage of user-generated and user-related data, to make relevant recommendations and show personalized advertisement.
In this case, the location is also part of the important required personal data to be shared.
Last but not least, \emph{data brokers} (e.g.,~Experian~\cite{experian}, TransUnion~\cite{transunion}, Acxiom~\cite{acxiom}, etc.) collect data from public and private resources, e.g.,~censuses, bank card transaction records, voter registration lists, etc.
Last but not least, \emph{data brokers}, e.g.,~Experian~\cite{experian}, TransUnion~\cite{transunion}, Acxiom~\cite{acxiom}, etc. collect data from public and private resources, e.g.,~censuses, bank card transaction records, voter registration lists, etc.
Most of these data are georeferenced and contain directly or indirectly location information; protecting the location of the user has become one of the most important privacy goals so far.
These different sources and types of data, on the one hand give useful feedback to the involved users and/or services, and on the other hand, when combined together, provide valuable information to various internal/external analytical services.
@ -21,11 +22,11 @@ Data from crowdsourced based applications, if not protected correctly, can be ea
Privacy-preserving processes usually introduce noise in the original or the aggregated data set in order to hide the sensitive information.
In the case of \emph{microdata}, a privacy-protected version, containing some synthetic data as well, is generated with the intrinsic goal to make the users indistinguishable.
In the case of \emph{statistical} data (i.e.,~the results of statistical queries over the original data sets), a privacy-protected version is generated by adding noise on the actual statistical values.
In the case of \emph{statistical} data, i.e.,~the results of statistical queries over the original data sets,, a privacy-protected version is generated by adding noise on the actual statistical values.
In both cases, we end up affecting the quality of the published data set.
The privacy and the utility of the `noisy' output are two contrasting desiderata, which need to be measured and balanced.
Furthermore, if we want to account for external additional information (e.g.,~linked or correlated data) and at the same time to ensure the same level of protection, we need to add additional noise, inevitably deteriorating the quality of the output.
This problem becomes particularly pertinent in the Big Data era, as the quality or \emph{Veracity} is one of the five dimensions (known as the five \emph{`V's'}) that define Big Data, and where there is an abundance of external information that cannot be ignored.
The privacy and the utility of the `noisy' output are two contrasting desiderata which need to be measured and balanced.
Furthermore, if we want to account for external additional information, e.g.,~linked or correlated data, and at the same time to ensure the same level of protection, we need to add additional noise, which inevitably deteriorates the quality of the output.
This problem becomes particularly pertinent in the Big Data era, as the quality or \emph{Veracity} is one of the five dimensions (known as the five \emph{`V's'}) that define Big Data and where there is an abundance of external information that cannot be ignored.
Since this needs to be taken into account \emph{prior} to the publishing of the data set or the aggregated statistics there of, introducing external information into privacy-preserving techniques becomes part of the traditional processing flow while keeping an acceptable quality to privacy ratio.
As we can observe in the examples mentioned above, there are many cases where data are not protected at source (what is also described as \emph{local} data privacy protection) for various reasons, e.g.,~the users do not want to pay extra, it is impossible due to technical complexity, because the quality of the expected service will be deteriorated, etc.
@ -37,7 +38,6 @@ Moreover, privacy-preserving algorithms are designed specifically for data publi
In that respect, we need to be able to correctly choose the proper privacy algorithm(s), which would allow users to share protected copies of their data with some guarantees.
The selection process is far from trivial, since it is essential to:
\begin{enumerate}
\itemsep-0.25em
\item select an appropriate privacy-preserving technique, relevant to the data set intended for public release;
\item understand the different requirements imposed by the selected technique and tune the different parameters according to the circumstances of the use case based on, e.g.,~assumptions, level of distortion, etc.~\cite{kifer2011no};
\item get the necessary balance between privacy and data utility, which is a significant task for any privacy algorithm as well as any privacy expert.
@ -51,15 +51,13 @@ Selecting the wrong privacy algorithm or configuring it poorly may put at risk t
\label{fig:data-value}
\end{figure}
In data privacy research, privacy in continuous data publishing scenarios is the area that is concerned by studying the privacy problems created when sensitive data are published continuously, either infinitely (e.g.,~streaming data) or by multiple continuous publications over a known period of time (e.g.,~finite time series data).
In data privacy research, privacy in continuous data publishing scenarios is the area that is concerned by studying the privacy problems created when sensitive data are published continuously, either infinitely, e.g.,~streaming data, or by multiple continuous publications over a known period of time, e.g.,~finite time series data.
This specific subfield of data privacy becomes increasingly important since it:
\begin{enumerate}[(i)]
\itemsep-0.25em
\item includes the most prominent cases, e.g.,~location (trajectory) privacy problems, and
\item provides the most challenging and yet not well charted part of the privacy algorithms since it is rather new and increasingly complex.
\end{enumerate}
In this context, we seek to offer a guide that would allow its users to choose the proper algorithm(s) for their specific use case accordingly.
Additionally, data in continuous data publishing use cases require a timely processing because their value usually decreases over time depending on the use case as demonstrated in Figure~\ref{fig:data-value}.
For this reason, we provide an insight into time-related properties of the algorithms, e.g.,~if they work on infinite, real-time data, or if they take into consideration existing data dependencies.
The importance of continuous data publishing is stressed by the fact that, commonly, many types of data have such properties, with geospatial data being a prominent case.

View File

@ -1,2 +1,37 @@
\section{Structure}
\label{sec:struct}
This thesis is structured as follows:
\paragraph{Chapter~\ref{ch:prel}}
introduces some relevant terminology and information around the problem of
quality and privacy in user-generated Big Data with a special focus on continuous data publishing.
First, in Section~\ref{sec:data}, we categorize user-generated data sets and review data processing in the context of continuous data publishing.
Second, in Section~\ref{sec:privacy}, we define information disclosure in data privacy. We list the categories of privacy attacks, the possible privacy protection levels, the fundamental privacy operations that are applied to achieve data privacy, and finally we provide a brief overview of the basic notions for data privacy protection.
Third, in Section~\ref{sec:correlation}, we focus on the impact of correlation on data privacy.
More particularly, we discuss the different types of correlation, we document ways to extract data correlation from continuous data, and we investigate the privacy risks that data correlation entails with special focus on the privacy loss under temporal correlation.
\paragraph{Chapter~\ref{ch:rel}}
reviews works that deal with privacy under continuous data publishing covering diverse use cases.
We present the relevant literature based on two levels of categorization.
First, we group works with respect to whether they deal with microdata or statistical data as input.
Then, we further group them into two subcategories depending on if they are designed for the finite or infinite observation setting.
\paragraph{Chapter~\ref{ch:lmdk-prv}}
proposes a novel configurable privacy scheme, \emph{{\thething} privacy} (Section~\ref{sec:thething}), which takes into account significant events (\emph{\thethings}) in the time series and allocates the available privacy budget accordingly.
We propose three privacy schemes that guarantee {\thething} privacy.
To further enhance our privacy methodology, and protect the {\thething} position in the time series, we propose techniques to perturb the initial {\thething} set (Section~\ref{sec:theotherthing}).
\paragraph{Chapter~\ref{ch:eval}}
presents the experiments that we performed in order to evaluate {\thething} privacy (Chapter~\ref{ch:lmdk-prv}) on real and synthetic data sets.
Section~\ref{sec:eval-dtl} contains all the details regarding the data sets the we used for our experiments along with the system configurations.
Section~\ref{sec:eval-lmdk} evaluates the data utility of the {\thething} privacy schemes that we designed in Section~\ref{sec:thething} and investigates the behavior of the privacy loss under temporal correlation for different distributions of {\thethings}.
Section~\ref{sec:eval-lmdk-sel} justifies our decisions while designing the privacy-preserving {\thething} selection module in Section~\ref{sec:theotherthing} and the data utility impact of the latter.
Finally, Section~\ref{sec:eval-sum} concludes this chapter by summarizing the main results derived from the experiments.
\paragraph{Chapter~\ref{ch:con}}
concludes the thesis and outlines possible future directions.

View File

@ -62,7 +62,7 @@
\newcommand{\thetitle}{Quality \& Privacy in User-generated Big Data: Algorithms \& Techniques}
\newcommand{\theyear}{****}
\newcommand{\theyear}{2021}
\newcommand{\thedate}{***** **, \theyear}
\newcommand{\thething}{landmark}
@ -85,6 +85,7 @@
\afterpage{\blankpage}
\input{abstract}
\input{resume}
\input{acknowledgements}
\tableofcontents

View File

@ -230,7 +230,7 @@ The calculation of FPL (Equation~\ref{eq:fpl-2}) becomes:
The authors propose solutions to bound the temporal privacy loss, under the presence of weak to moderate correlation, in both finite and infinite data publishing scenarios.
In the latter case, they try to find a value for $\varepsilon$ for which the backward and forward privacy loss are equal.
In the former, they similarly try to balance the backward and forward privacy loss while they allocate more $\varepsilon$ at the first and last timestamps, since they have higher impact to the privacy loss of the next and previous ones.
In the former, they similarly try to balance the backward and forward privacy loss while they allocate more $\varepsilon$ at the first and last timestamps, since they have higher impact on the privacy loss of the next and previous ones.
This way they achieve an overall constant temporal privacy loss throughout the time series.
According to the technique's intuition, stronger correlation result in higher privacy loss.

View File

@ -4,17 +4,18 @@
% Crowdsensing applications
The plethora of sensors currently embedded in personal devices and other infrastructures have paved the way for the development of numerous \emph{crowdsensing services} (e.g.,~Ring~\cite{ring}, TousAntiCovid~\cite{tousanticovid}, Waze~\cite{waze}, etc.) based on the collected personal, and usually geotagged and timestamped data.
% Continuously user-generated data
User--service interactions gather personal event-like data, that are data items comprised of pairs of an identifying attribute of an individual and the---possibly sensitive---information at a timestamp (including contextual information), e.g.,~(\emph{`Bob', `dining', `Canal Saint-Martin', $17{:}00$}).
When the interactions are performed in a continuous manner, we obtain ~\emph{time series} of events.
User--service interactions gather personal event-like data that are data items comprised by pairs of an identifying attribute of an individual and the---possibly sensitive---information at a timestamp (including contextual information), e.g.,~(\emph{`Bob', `dining', `Canal Saint-Martin', $17{:}00$}).
%For a reminder, when the interactions are performed in a continuous manner, we obtain time series of events.
% Observation/interaction duration
Depending on the duration, we distinguish the interaction/observation into \emph{finite}, when taking place during a predefined time interval, and \emph{infinite}, when taking place in an uninterrupted fashion.
%Depending on the duration, we distinguish the interaction/observation into finite, when taking place during a predefined time interval, and infinite, when taking place in an uninterrupted fashion.
Example~\ref{ex:scenario} shows the result of user--LBS interaction while retrieving location-based information or reporting user-state at various locations.
\begin{example}
\label{ex:scenario}
Consider a finite sequence of spatiotemporal data generated by Bob during an interval of $8$ timestamps, as shown in Figure~\ref{fig:scenario}.
Events in a shade correspond to privacy-sensitive events that Bob has defined beforehand. For instance his home is around {\'E}lys{\'e}e, his workplace is around the Louvre, and his hangout is around Canal Saint-Martin.
Events in a shade correspond to privacy-sensitive
\kat{You should not say that only significant events are privacy sensitive, because then why put noise to the normal timestamps? Maybe say directly significant for the shaded events?} events that Bob has defined beforehand. For instance, $p_1$ and $p_8$ are significant because he was at his home, which is around {\'E}lys{\'e}e, at $p_3$ he was at his workplace around the Louvre, and at $p_5$ he was at his hangout around Canal Saint-Martin.
\begin{figure}[htp]
\centering
@ -35,10 +36,11 @@ A widely recognized tool that introduces probabilistic randomness to the origina
Due to its \emph{composition} property, i.e.,~the combination of differentially private outputs satisfies differential privacy as well, differential privacy is suitable for privacy-preserving time series publishing.
\emph{Event}, \emph{user}~\cite{dwork2010differential, dwork2010pan}, and \emph{$w$-event}~\cite{kellaris2014differentially} comprise the possible levels of privacy protection.
Event-level limits the privacy protection to \emph{any single event}, user-level protects \emph{all the events} of any user, and $w$-event provides privacy protection to \emph{any sequence of $w$ events}.
\kat{Please write another introduction for your chapter, that is in connection to your thesis, not the paper.. all this information in this paragraph must be said in the introduction of the thesis, not of the chapter.. }
In this chapter, we propose a novel configurable privacy scheme, \emph{\thething} privacy, which takes into account significant events (\emph{\thethings}) in the time series and allocates the available privacy budget accordingly.
We propose three privacy models that guarantee {\thething} privacy.
To further enhance our privacy method, and protect the {\thethings} position in the time series, we propose techniques to perturb the initial {\thethings} set (Section~\ref{sec:theotherthing}).
In this chapter, we propose a novel configurable privacy scheme, \emph{\thething} privacy (Section~\ref{sec:thething}), which takes into account significant events (\emph{\thethings}) in the time series and allocates the available privacy budget accordingly.
We propose three privacy schemes that guarantee {\thething} privacy.
To further enhance our privacy methodology, and protect the {\thethings} position in the time series, we propose techniques to perturb the initial {\thethings} set (Section~\ref{sec:theotherthing}).\kat{this is the content that you must enrich and motivate more in the intro of this chapter}
\input{problem/thething/main}
\input{problem/theotherthing/main}

View File

@ -3,7 +3,7 @@
In this chapter, we presented \emph{{\thething} privacy} for privacy-preserving time series publishing, which allows for the protection of significant events, while improving the utility of the final result with respect to the traditional user-level differential privacy.
We also proposed three models for {\thething} privacy, and quantified the privacy loss under temporal correlation.
Furthermore, we present three solutions to enhance our privacy scheme by protecting the actual temporal position of the{\thethings} in the time series.
Furthermore, we present three solutions to enhance our privacy scheme by protecting the actual temporal position of the {\thethings} in the time series.
We differ the experimental evaluation of our methodology to Chapter~\ref{ch:eval} we experiment with real and synthetic data sets to demonstrate the applicability of the {\thething} privacy models by themselves (Section~\ref{sec:eval-lmdk-sel}) and in combination with the {\thething} selection component (Section~\ref{sec:eval-lmdk}).
%Our experiments on real and synthetic data sets validate our proposal.

View File

@ -2,5 +2,5 @@
\label{subsec:lmdk-contrib}
In this section, we formally define a novel privacy notion that we call \emph{{\thething} privacy}.
We apply this privacy notion to time series consisting of \emph{{\thethings}} and regular events, and we design and implement three {\thething} privacy mechanisms.
We apply this privacy notion to time series consisting of \emph{{\thethings}} and regular events, and we design and implement three {\thething} privacy schemes.
We further study {\thething} privacy under temporal correlation that is inherent in time series publishing.

View File

@ -1,50 +1,55 @@
\section{Significant events}
\label{sec:thething}
The privacy mechanisms for the aforementioned levels assume that in a time series any single event, or any sequence of events, or the entire series of events is equally privacy-significant for the users.
In reality, this is an simplistic assumption.
The significance of an event is related to certain user-defined privacy criteria, or to its adjacent events, as well as to the entire time series.
We term significant events as \emph{{\thething} events} or simply \emph{\thethings}.
Identifying {\thethings} can be done in an automatic or manual way (but is out of scope for this work).
The privacy mechanisms for the user, w-event and event levels that are already proposed in the literature, assume that in a time series any single event, or any sequence of events, or the entire series of events is equally privacy-significant for the users.
In reality, this is a simplistic\kat{I would not say simplistic, but unrealistic assumption that deteriorates unnecessarily the quality of the perturbed data} assumption.
The fact that an event is significant, can be related to certain user-defined privacy criteria, or to its adjacent events, as well as to the entire time series.
We term significant events as \emph{{\thething} events} or simply \emph{\thethings}, following relevant literature\kat{can you find some other work that uses the same term? otherwise one can raise the question why not ot use the word significant }.
Identifying {\thethings} in time series can be done in an automatic or manual way.
For example, in spatiotemporal data, \emph{places where an individual spent some time} denote \emph{points of interest} (POIs) (called also stay points)~\cite{zheng2015trajectory}.
Such events, and more particularly their spatial attribute values, can be less privacy-sensitive~\cite{primault2018long}, e.g.,~parks, theaters, etc. or, if individuals frequent them, they can reveal supplementary information, e.g.,~residences (home addresses)~\cite{gambs2010show}, places of worship (religious beliefs)~\cite{franceschi-bicchierairussell2015redditor}, etc.
Such events, and more particularly their spatial attribute values, can be less privacy-sensitive~\cite{primault2018long}, e.g.,~parks, theaters, etc., or, if individuals frequent them, they can reveal supplementary information, e.g.,~residences (home addresses)~\cite{gambs2010show}, places of worship (religious beliefs)~\cite{franceschi-bicchierairussell2015redditor}, etc.
POIs can be an example of how we can choose {\thethings}, but the idea is not limited to these.
Another example is the detection of privacy-sensitive user interactions by \emph{contact tracing} applications.
This can be practical in decease control~\cite{eames2003contact}, similar to the recent outbreak of the Coronavirus disease 2019 (COVID-19) epidemic~\cite{ahmed2020survey}.
Last but not least, {\thethings} in \emph{smart grid} electricity usage patterns could not only reveal the energy consumption of a user but also information regarding activities, e.g.,~`at work', `sleeping', etc. and types of appliances already installed or recently purchased~\cite{khurana2010smart}.
Last but not least, {\thethings} in \emph{smart grid} electricity usage patterns may not only reveal the energy consumption of a user but also information regarding activities, e.g.,~`at work', `sleeping', etc., or types of appliances already installed or recently purchased~\cite{khurana2010smart}.
We stress out that {\thething} identification is an orthogonal problem to ours, and that we consider {\thethings} given as input to our problem.
\begin{example}
\label{ex:st-cont}
Figure~\ref{fig:st-cont} shows the case when we want to protect all of Bob's significant events ($p_1$, $p_3$, $p_5$, $p_8$) in his trajectory shown in Figure~\ref{fig:scenario}.
% That is, we have to allocate privacy budget $\varepsilon$ such that at any timestamp $t$ it holds that $\varepsilon_t + \varepsilon_1 + \varepsilon_3 + \varepsilon_5 + \varepsilon_8 \leq \varepsilon$.
In this scenario, event-level protection is not suitable since it can only protect one event at a time.
Hence, we have to apply user-level privacy protection by distributing equal portions of $\varepsilon$ to all the events, i.e.,~$\frac{\varepsilon}{8}$ to each one (the equivalent of applying $8$-event privacy).
In this way, we have protected the {\thething} points; we have allocated a total of $\frac{\varepsilon}{2}<\varepsilon$ to the {\thethings}.
\begin{figure}[htp]
\centering
\includegraphics[width=\linewidth]{problem/st-cont}
\caption{User-level and {\thething} $\varepsilon$-differential privacy protection for the time series of Figure~\ref{fig:scenario}.}
\label{fig:st-cont}
\end{figure}
However, perturbing by $\frac{\varepsilon}{8}$ each regular point deteriorates the data utility unnecessarily.
Notice that the overall privacy budget that we ended up allocating to the user-defined significant events is equal to $\frac{\varepsilon}{2}$ and leaves an equal amount of budget to distribute to any current event.
In other words, uniformly allocating $\frac{\varepsilon}{5}$ to every event would still achieve the Bob's privacy goal, i.e.,~protect every significant event, while achieving better utility overall.
\end{example}
We argue that protecting only {\thething} events along with any regular event release is sufficient for the user's protection, while it improves data utility.
Considering {\thething} events can prevent over-perturbing the data in the benefit of their final quality.
We argue that protecting only {\thething} events along with any regular event release -- instead of protecting every event in the timeseries -- is sufficient for the user's protection, while it improves data utility.
More specifically, important events are adequately protected, while less important ones are not excessively perturbed. \kat{something feels wrong with this statement, because in figure 2 regular and landmarks seem to receive the same amount of noise..}
%In fact, considering {\thething} events can prevent over-perturbing the data in the benefit of their final quality.
Take for example the scenario in Figure~\ref{fig:st-cont}, where {\thethings} are highlighted in gray.
If we want to protect the {\thething} points, we have to allocate at most a budget of $\varepsilon$ to the {\thethings}, while saving some for the release of regular events.
Essentially, the more budget we allocate to an event the less we protect it, but at the same time we maintain its utility.
With {\thething} privacy we propose to distribute the budget taking into account only the existence of the {\thethings} when we release an event of the time series, i.e.,~allocating $\frac{\varepsilon}{5}$ ($4\ \text{\thethings} + 1\ \text{regular point}$) to each event (see Figure~\ref{fig:st-cont}).
This way, we still guarantee that the {\thethings} are adequately protected, as they receive a total budget of $\frac{4\varepsilon}{5}<\varepsilon$.
At the same time, we avoid over-perturbing the regular events, as we allocate to them a higher total budget ($\frac{4\varepsilon}{5}$) than in user-level ($\frac{\varepsilon}{2}$), and thus less noise.
This way, we still guarantee\footnote{$\epsilon$-differential privacy guarantees that the allocated budget should be less or equal to $\epsilon$, and not precisely how much.\kat{Mano check.}} that the {\thethings} are adequately protected, as they receive a total budget of $\frac{4\varepsilon}{5}<\varepsilon$.
At the same time, we avoid over-perturbing the regular events, as we allocate to them a higher total budget ($\frac{4\varepsilon}{5}$) compared to the user-level scenario ($\frac{\varepsilon}{2}$), and thus less noise.
\begin{example}
\label{ex:st-cont}
Figure~\ref{fig:st-cont} shows the case when we want to protect all of Bob's significant events ($p_1$, $p_3$, $p_5$, $p_8$) in his trajectory shown in Figure~\ref{fig:scenario}.
% That is, we have to allocate privacy budget $\varepsilon$ such that at any timestamp $t$ it holds that $\varepsilon_t + \varepsilon_1 + \varepsilon_3 + \varepsilon_5 + \varepsilon_8 \leq \varepsilon$.
In this scenario, event-level protection is not suitable since it can only protect one event at a time.
Hence, we have to apply user-level privacy protection by distributing equal portions of $\varepsilon$ to all the events, i.e.,~$\frac{\varepsilon}{8}$ to each one (the equivalent of applying $8$-event privacy).
In this way, we have protected the {\thething} points; we have allocated a total of $\frac{\varepsilon}{2}<\varepsilon$ to the {\thethings}.
\begin{figure}[htp]
\centering
\includegraphics[width=\linewidth]{problem/st-cont}
\caption{User-level and {\thething} $\varepsilon$-differential privacy protection for the time series of Figure~\ref{fig:scenario}.}
\label{fig:st-cont}
\end{figure}
However, perturbing by $\frac{\varepsilon}{8}$ each regular point deteriorates the data utility unnecessarily.
Notice that the overall privacy budget that we ended up allocating to the user-defined significant events is equal to $\frac{\varepsilon}{2}$ and leaves an equal amount of budget to distribute to any current event.
In other words, uniformly allocating $\frac{\varepsilon}{5}$ to every event would still achieve the Bob's privacy goal, i.e.,~protect every significant event, while achieving better utility overall.
\end{example}
\input{problem/thething/contribution}
\input{problem/thething/problem}
\input{problem/thething/solution}

View File

@ -1,7 +1,7 @@
\chapter{Related work}
\label{ch:rel}
\kat{Change the way you introduce the related work chapter; do not list a series of surveys. You should speak about the several directions for privacy preserving methods (and then citing the surveys if you want). Then, you should focus on the particular configuration that you are interested in (continual observation). Summarize what we will see in the next sections by giving also the general structure of the chapter.}
\kat{Change the way you introduce the related work chapter; do not list a series of surveys. You should speak about the several directions for privacy-preserving methods (and then citing the surveys if you want). Then, you should focus on the particular configuration that you are interested in (continual observation). Summarize what we will see in the next sections by giving also the general structure of the chapter.}
Since the domain of data privacy is vast, several surveys have already been published with different scopes.
A group of surveys focuses on specific different families of privacy-preserving algorithms and techniques.
@ -16,8 +16,8 @@ Finally, there are some surveys on application-specific privacy challenges.
For example, Zhou et al.~\cite{zhou2008brief} have a focus on social networks, and Christin et al.~\cite{christin2011survey} give an outline of how privacy aspects are addressed in crowdsensing applications.
In this chapter, we document works that deal with privacy under continuous data publishing covering diverse use cases.
We present the works in the literature based on two levels of categorisation.
First, we group works with respect to whether they receive microdata or statistical data (see Section~\ref{subsec:data-categories} for the definitions) as input.
We present the works in the literature based on two levels of categorization.
First, we group works with respect to whether they deal with microdata or statistical data (see Section~\ref{subsec:data-categories} for the definitions) as input.
Then, we further group them into two subcategories, whether they are designed for the finite or infinite (see Section.~\ref{subsec:data-publishing}) observation setting. \kat{continue.. say also in which category you place your work}
%Such a documentation becomes very useful nowadays, due to the abundance of continuously user-generated data sets that could be analyzed and/or published in a privacy-preserving way, and the quick progress made in this research field.

View File

@ -433,7 +433,7 @@ This calculation is done for each individual that is included in the original da
The backward/forward privacy loss at any time point depends on the backward/forward privacy loss at the previous/next instance, the backward/forward temporal correlations, and $\varepsilon$.
The authors propose solutions to bound the temporal privacy loss, under the presence of weak to moderate correlations, in both finite and infinite data publishing scenarios.
In the latter case, they try to find a value for $\varepsilon$ for which the backward and forward privacy loss are equal.
In the former, they similarly try to balance the backward and forward privacy loss while they allocate more $\varepsilon$ at the first and last time points, since they have higher impact to the privacy loss of the next and previous ones.
In the former, they similarly try to balance the backward and forward privacy loss while they allocate more $\varepsilon$ at the first and last time points, since they have higher impact on the privacy loss of the next and previous ones.
This way they achieve an overall constant temporal privacy loss throughout the time series.
According to the technique's intuition, stronger correlations result in higher privacy loss.
However, the loss is smaller when the dimension of the transition matrix, which is extracted according to the modeling of the correlations (here it is Markov chain), is larger due to the fact that larger transition matrices tend to be uniform, resulting in weaker data dependence.

View File

@ -1,5 +1,6 @@
\section{Summary}
\label{sec:sum-rel}
This is the summary of this chapter.
\kat{? Don't forget to mention here the publication that you have.}
In this chapter, we offer a guide that would allow its users to choose the proper algorithm(s) for their specific use case.
\kat{? Don't forget to mention here the publication that you have.}

25
text/resume.tex Normal file
View File

@ -0,0 +1,25 @@
\chapter{Résumé}
\label{ch:res}
Les capteurs, les appareils portables et les services basés sur la localisation génèrent quotidiennement des quantités massives de données géolocalisées et/ou liées à la localisation et aux utilisateurs.
La manipulation de ces données est utile dans de nombreux domaines d'application, e.g.,~les soins de santé, les bâtiments intelligents, et la surveillance du trafic.
Un pourcentage élevé de ces données contient des informations sur les activités des utilisateurs et d'autres détails personnels, et donc leur manipulation et leur partage soulèvent des inquiétudes quant à la confidentialité des personnes concernées.
Cependant, la manière continue avec laquelle les données sont générées de nos jours et la haute disponibilité de sources d'information externes posent davantage de menaces et ajoutent des défis supplémentaires au problème.
Il est donc essentiel de concevoir des solutions qui non seulement garantissent la protection de la confidentialité, mais offrent également une configurabilité et tiennent compte des préférences des utilisateurs.
Dans cette thèse, nous étudions la littérature concernant la confidentialité des données dans la publication de données en continu, et rapportons les solutions proposées, avec un accent particulier sur les solutions concernant la localisation ou les données géo-référencées.
En fait, une multitude d'algorithmes ont été proposés pour la publication de données préservant la confidentialité, que ce soit pour des microdonnées ou des données statistiques.
Dans ce contexte, nous cherchons à offrir un guide qui permettrait aux lecteurs de choisir en conséquence le ou les algorithmes appropriés pour leur cas d'utilisation spécifique.
Nous donnons un aperçu des propriétés temporelles des algorithmes, e.g.,~s'ils fonctionnent sur des données infinies en temps réel, ou s'ils prennent en considération la dépendance des données existantes.
Après avoir discuté de la littérature sur la publication continue des données, nous continuons à proposer un nouveau type de confidentialité des données, appelé \emph{confidentialité {\thething}}.
Nous soutenons que dans la publication continue de données, les événements ne sont pas aussi importants en termes de confidentialité et, par conséquent, ils devraient affecter différemment le traitement préservant la confidentialité.
La confidentialité différentielle est un paradigme bien établi dans la publication de séries chronologiques préservant la confidentialité.
Différents schémas existent, protégeant soit un seul horodatage, soit toutes les données par utilisateur ou par fenêtre dans la série temporelle, considérant cependant tous les horodatages comme également significatifs.
Le nouveau schéma que nous proposons, confidentialité {\thething}, est basé sur une confidentialité différentielle, mais prend également en compte les événements significatifs (\emph{\thethings}) dans la série chronologique et alloue le budget de confidentialité disponible en conséquence.
Nous concevons trois schémas de confidentialité qui garantissent la confidentialité {\thething} et les étendons davantage afin de fournir une protection de confidentialité plus robuste à l'ensemble {\thething}.
Nous évaluons notre proposition sur des ensembles de données réelles et synthétiques et évaluons l'impact sur l'utilité des données en mettant l'accent sur les situations en présence de corrélation temporelle.
Les résultats de l'évaluation expérimentale et de l'analyse comparative de la confidentialité {\thething} valident son applicabilité à plusieurs scénarios de cas d'utilisation avec et sans la présence de corrélation temporelle.
\paragraph{Mots clés :}
confidentialité des informations, publication continue des données, crowdsensing, traitement des données préservant la confidentialité