Structure
This commit is contained in:
20
text/related/main.tex
Normal file
20
text/related/main.tex
Normal file
@ -0,0 +1,20 @@
|
||||
\chapter{Related work}
|
||||
\label{ch:rel}
|
||||
|
||||
Since the domain of data privacy is vast, several surveys have already been published with different scopes.
|
||||
A group of surveys focuses on specific different families of privacy-preserving algorithms and techniques.
|
||||
For instance, Simi et al.~\cite{simi2017extensive} provide an extensive study of works on $k$-anonymity and Dwork~\cite{dwork2008differential} focuses on differential privacy.
|
||||
Another group of surveys focuses on techniques that allow the execution of data mining or machine learning tasks with some privacy guarantees, e.g.,~Wang et al.~\cite{wang2009survey}, and Ji et al.~\cite{ji2014differential}.
|
||||
In a more general scope, Wang et al.~\cite{wang2010privacy} analyze the challenges of privacy-preserving data publishing, and offer a summary and evaluation of relevant techniques.
|
||||
Additional surveys look into issues around Big Data and user privacy.
|
||||
Indicatively, Jain et al.~\cite{jain2016big}, and Soria-Comas and Domingo-Ferrer~\cite{soria2016big} examine how Big Data conflict with pre-existing concepts of privacy-preserving data management, and how efficiently $k$-anonymity and $\varepsilon$-differential privacy deal with the characteristics of Big Data.
|
||||
Others narrow down their research to location privacy issues.
|
||||
To name a few, Chow and Mokbel~\cite{chow2011trajectory} investigate privacy protection in continuous LBSs and trajectory data publishing, Chatzikokolakis et al.~\cite{chatzikokolakis2017methods} review privacy issues around the usage of LBSs and relevant protection mechanisms and metrics, Primault et al.~\cite{primault2018long} summarize location privacy threats and privacy-preserving mechanisms, and Fiore et al.~\cite{fiore2019privacy} focus only on privacy-preserving publishing of trajectory microdata.
|
||||
Finally, there are some surveys on application-specific privacy challenges.
|
||||
For example, Zhou et al.~\cite{zhou2008brief} have a focus on social networks, and Christin et al.~\cite{christin2011survey} give an outline of how privacy aspects are addressed in crowdsensing applications.
|
||||
Nevertheless, to the best of our knowledge, there is no up-to-date survey that deals with privacy under continuous data publishing covering diverse use cases.
|
||||
Such a survey becomes very useful nowadays, due to the abundance of continuously user-generated data sets that could be analyzed and/or published in a privacy-preserving way, and the quick progress made in this research field.
|
||||
|
||||
\input{related/micro}
|
||||
\input{related/statistical}
|
||||
\input{related/summary}
|
418
text/related/micro.tex
Normal file
418
text/related/micro.tex
Normal file
@ -0,0 +1,418 @@
|
||||
\section{Microdata}
|
||||
\label{sec:micro}
|
||||
|
||||
As observed in Table~\ref{tab:micro}, privacy-preserving algorithms for microdata rely mostly on $k$-anonymity or derivatives of it.
|
||||
Ganta et al.~\cite{ganta2008composition} revealed that $k$-anonymity methods are vulnerable to complementary release attacks (or \emph{composition attacks} in the original publication).
|
||||
Consequently, the research community proposed solutions based on $k$-anonymity, focusing on different threats linked to continuous publication, as we review later on.
|
||||
However, notice that only a couple~\cite{li2016hybrid,shmueli2015privacy}
|
||||
of the following works assume that data sets are privacy-protected \emph{independently} of one another, meaning that the publisher is oblivious of the rest of the publications.
|
||||
On the other side, algorithms that are based on differential privacy are not concerned with so specific attacks as, by definition, differential privacy considers that the adversary may possess any kind of background knowledge.
|
||||
Later on, data dependencies were also considered for differential privacy algorithms, to account for the extra privacy loss entailed by them.
|
||||
|
||||
\includetable{table-micro}
|
||||
|
||||
|
||||
\subsection{Finite observation}
|
||||
\label{subsec:micro-finite}
|
||||
|
||||
% Anonymizing sequential releases
|
||||
% - microdata
|
||||
% - finite (sequential)
|
||||
% - batch
|
||||
% - complementary release (form the quasi-identifiers from joining releases)
|
||||
% - user
|
||||
% - k-anonymity
|
||||
% - generalization + specialisation
|
||||
\hypertarget{wang2006anonymizing}{Wang and Fung}~\cite{wang2006anonymizing} address the problem of anonymously releasing different projections (i.e.,~subsets of the attributes) of the same data set in subsequent timestamps.
|
||||
More precisely, the authors want to protect individual information that could be revealed from joining various releases of the same data set.
|
||||
To do so, instead of locating the quasi-identifiers in a single release, the authors suggest that the identifiers may span the current and all previous releases of the (projections of the) data set.
|
||||
Then, the proposed method uses the join of the different releases on the common identifying attributes.
|
||||
The goal is to generalize the identifying attributes of the current release, given that previous releases are immutable.
|
||||
The generalization is performed in a top down manner, meaning that the attributes are initially over-generalized, and step by step are specialized until they reach the point when predefined quality and privacy requirements are met.
|
||||
The privacy requirement is the so-called \emph{($X$, $Y$)-privacy} for a threshold $k$, meaning that the identifying attributes in $X$ are linked with at most $k$ sensitive values in $Y$, in the join of the previously released and current data sets.
|
||||
The quality requirement can be tuned into the framework.
|
||||
Namely, the authors propose three alternatives: the reduction of the class entropy~\cite{quinlan2014c4, shannon2001mathematical}, the notion of distortion, and the discernibility~\cite{bayardo2005data}.
|
||||
The anonymization algorithm for releasing a data set in the existence of a previously released data set takes into account the scalability and performance problems that a join among those two may entail.
|
||||
Still, when many previous releases exist, the complexity would remain high.
|
||||
|
||||
% Anonymity for continuous data publishing
|
||||
% - microdata
|
||||
% - finite (incremental)
|
||||
% - batch
|
||||
% - complementary release (tuple correspondance attack)
|
||||
% - user
|
||||
% - k-anonymity
|
||||
% - generalization + specialization
|
||||
\hypertarget{fung2008anonymity}{Fung et al.}~\cite{fung2008anonymity} introduce the problem of privately releasing continuous incremental data sets.
|
||||
As a reminder, the invariant of this kind of releases is that at every timestamp $t_i$, the records previously released at $t_j$ ($j < i$) are released again together with a set of new records.
|
||||
The authors first focus in two consecutive releases and describe three classes of possible attacks, which fall under the general category of complementary attacks.
|
||||
They name these attacks \emph{correspondence attacks} because they rely on the principle that all tuples from an original data set $D_1$, from timestamp $t_1$, correspond to a tuple in the data set $D_2$, from timestamp $t_2$.
|
||||
Naturally, the opposite does not hold, as tuples added at $t_2$ do not exist in $D_1$.
|
||||
Assuming that the attacker knows the quasi-identifiers and the timestamp of the record of a person, they define the \emph{backward}, \emph{cross}, and \emph{forward} (\emph{BCF}) attacks.
|
||||
They show that combining two individually $k$-anonymized subsequent releases using one of the aforementioned attacks can lead to `cracking' some of the records in the set of $k$ candidate tuples rendering the privacy level lower than $k$.
|
||||
Except for the detection of cases of compromising BCF anonymity between two releases, the authors also provide an anonymization algorithm for a release $\pmb{o}_2$ in the presence of a private release $\pmb{o}_1$.
|
||||
The algorithm starts from the most possible generalized state for the quasi-identifiers of the records in $D_2$.
|
||||
Step by step, it checks which combinations of specializations on the attributes do not violate the BCF anonymity and outputs the most possible specialized version of the data set.
|
||||
The authors discuss how the framework extends to multiple releases and to different kinds of privacy methods (other than $k$-anonymity).
|
||||
It is worth noting that to maintain a certain quality for a release, it is essential that the delta among subsequent releases is large enough; otherwise the needed generalization level may destroy the utility of the data set.
|
||||
|
||||
% K anonymity for trajectories with spatial distortion
|
||||
% - microdata
|
||||
% - finite (sequential)(trajectories)
|
||||
% - batch
|
||||
% - complementary release
|
||||
% - user
|
||||
% - clustering & k-anonymity
|
||||
% - distortion (on the centroid)
|
||||
\hypertarget{abul2008never}{Abul et al}.~\cite{abul2008never} defined \emph{($k$, $\delta$)-anonymity} for enabling high-quality moving-objects data sets publishing.
|
||||
The authors claim that the classical $k$-anonymity framework cannot be directly applied to such kind of data from a data-centric perspective.
|
||||
The traditional distortion techniques in $k$-anonymity, i.e.,~generalization or suppression, yield great loss of information.
|
||||
On the one hand, suppression diminishes the size of the database.
|
||||
On the other hand, generalization demands the existence of quasi-identifiers, the values of which are going to be generalized.
|
||||
In trajectories, however, all points can be equally considered as quasi-identifiers.
|
||||
Obviously, a generalization of all the trajectories points would yield great levels of distortion.
|
||||
For this reason, a new, spatial-based distortion method is proposed.
|
||||
After clustering the trajectories in groups of at least $k$ elements, each trajectory is translated into a new one, in a vicinity of a predefined threshold $\delta$.
|
||||
Of course, the newly generated trajectories should still form a $k$-anonymous set.
|
||||
The authors validate their theory by experimentally showing that the resulting distance of count queries executed over a data set and its ($k$, $\delta$) version, remains low.
|
||||
However, a comparative evaluation to existing clustering techniques, e.g.,~$k$-means would have been interesting, to better support the contributions on this part of the solution as well.
|
||||
|
||||
% Privacy-utility trade-off under continual observation
|
||||
% - microdata
|
||||
% - finite
|
||||
% - batch/streaming
|
||||
% - dependence
|
||||
% - user
|
||||
% - perturbation (randomization)
|
||||
% - temporal correlations (HMM)
|
||||
% - local
|
||||
\hypertarget{erdogdu2015privacy}{Erdogdu and Fawaz}~\cite{erdogdu2015privacy} consider the scenario where privacy-conscious individuals separate the data that they generate into sensitive, and non-sensitive.
|
||||
The individuals keep the former unreleased, and publish samples of the latter to a service provider.
|
||||
Privacy mapping, implemented as a stochastic process, distorts the non-sensitive data samples locally, and a separable distortion metric (e.g.,~Hamming distance) calculates the discrepancy of the distorted data from the original.
|
||||
The goal of the privacy mapping is to find a balance between the distortion and privacy metric, i.e.,~achieve maximum released data utility, while offering sufficient privacy guarantees.
|
||||
The authors assume that there is a data dependence (modeled with an HMM) between the two data sets, and thus the release of the distorted data set can reveal information about the sensitive one.
|
||||
They investigate both a simple attack setting, and a complex one.
|
||||
In the simple attack, the adversary can make static assumptions, based only on the so far made observations that cannot be later altered.
|
||||
In the complex attack, past, and future data releases affect dynamically the assumptions that an adversarial entity makes.
|
||||
In both cases, the framework quantifies the information leakage at any time point using a privacy metric that measures the improvement of the adversarial inference of the sensitive data set, which the individual kept secret, after observing the data released at that particular point.
|
||||
Throughout the process, the authors consider both the batch, and the streaming processing schemes.
|
||||
However, the assumption that individuals are privacy-conscious can drastically limit the applicability of the framework.
|
||||
Furthermore, the metrics that the framework utilizes for the evaluation of the privacy guarantees that it provides are not intuitive.
|
||||
|
||||
% M-invariance: towards privacy preserving re-publication of dynamic data sets
|
||||
% - microdata
|
||||
% - finite
|
||||
% - batch
|
||||
% - complementary release (intersection of sensitive values)
|
||||
% - user
|
||||
% - k-anonymity
|
||||
% - generalization + synthetic data insertion
|
||||
\hypertarget{xiao2007m}{Xiao et al.}~\cite{xiao2007m} consider the case when a data set is (re)published in different timestamps
|
||||
in an update (insert/delete tuple) manner.
|
||||
More precisely, they address data anonymization in continuous publishing by implementing $m$-\emph{invariance}.
|
||||
In a simple $k$-anonymity (or $l$-diverse) scenario the privacy of an individual existing in two updates can be compromised by the intersection of the set of sensitive values.
|
||||
In contrast, an individual who exists in a series of $m$-invariant releases is always associated with the same set of $m$ different sensitive values.
|
||||
To enable the publishing of $m$-invariant data sets, artificial tuples (\emph{counterfeits}) may be added in a release.
|
||||
To minimize the noise added to the data sets, the authors provide an algorithm with two extra desiderata: limit the counterfeits, and minimize the quasi-identifiers' generalization level.
|
||||
Still, the choice of adding tuples with specific sensitive values disturbs the value distribution with a direct effect on any relevant statistics analysis.
|
||||
|
||||
% Preventing equivalence attacks in updated, anonymized data
|
||||
% - microdata
|
||||
% - finite
|
||||
% - batch
|
||||
% - complementary release (equivalence attack)
|
||||
% - user
|
||||
% - m-invariance (k-anonymity)
|
||||
% - generalization + synthetic data insertion
|
||||
In the same update setting (insert/delete tuple), \hypertarget{he2011preventing}{He et al.}~\cite{he2011preventing} introduce another kind of attack, namely the \emph{equivalence} attack, not taken into account by the aforementioned $m$-invariance technique.
|
||||
The equivalence attack allows for sets of individuals to be considered equivalent as far as the sensitive attribute is concerned, in different timestamps.
|
||||
In this way, all the members of the equivalence class will be harmed, if the sensitive value is learned even for only one member.
|
||||
For a number of releases to be private, they have to be both $m$-invariant and $e$-equivalent ($e < m$).
|
||||
The authors propose an algorithm incorporating $m$-invariance, based on the graph optimization \emph{min cut} problem, for publishing $e$-equivalent data sets.
|
||||
The proposed method can achieve better levels of privacy, in comparable times and quality as $m$-invariance.
|
||||
|
||||
% Privacy by diversity in sequential releases of databases
|
||||
% - microdata
|
||||
% - finite (sequential)
|
||||
% - batch
|
||||
% - complementary release (unknown previous releases )
|
||||
% - user
|
||||
% - l-diversity
|
||||
% - generalization + permutation of sensitive information among tuples with the same quasi-identifiers
|
||||
\hypertarget{Shmueli}{Shmueli and Tassa}~\cite{shmueli2015privacy} identified the computational inefficiency of anonymously releasing a data set, taking into account previous ones, in scenarios of continuous data publishing.
|
||||
The released data sets contain subsets of attributes of an original data set, while the authors propose an extension for attribute addition.
|
||||
Their algorithm can compute $l$-diverse anonymized releases (over different subsets of attributes) in parallel by generating $l - 1$ so-called \emph{fake} worlds.
|
||||
A fake world is generated from the base data set by randomly permutating non-identifier and sensitive values among the tuples, in such a way that minimal information loss (quality desideratum) is incurred.
|
||||
This is partially accomplished by verifying that the permutation is done among quasi-identifiers that are similar.
|
||||
Then, the algorithm creates buckets of tuples with at least $l$ different sensitive values, in which the quasi-identifiers will then be generalized in order to achieve $l$-diversity (privacy protection desideratum).
|
||||
The generalization step is also conducted in an information-loss efficient way.
|
||||
All different releases will be $l$-diverse because they are created assuming the same possible worlds, with which they are consistent.
|
||||
Tuples/attributes deletion is briefly discussed and left as an open question.
|
||||
The article is contrasted with a previous work~\cite{shmueli2012limiting} of the same authors, claiming that the new approach considers a stronger adversary (the adversary knows all individuals with their quasi-identifiers in the data set, and not only one), and that the computation is much more efficient, as it does not have an exponential complexity with respect to the number of previous publications.
|
||||
|
||||
% A hybrid approach to prevent composition attacks for independent data releases
|
||||
% - microdata
|
||||
% - finite
|
||||
% - batch
|
||||
% - complementary release (releases unknown to the publisher)
|
||||
% - user
|
||||
% - k-anonymity
|
||||
% - generalization + noise (from normal distribution)
|
||||
\hypertarget{li2016hybrid}{Li et al.}~\cite{li2016hybrid} identified a common characteristic in most of the privacy techniques: when anonymizing a data set all previous releases are known to the data publisher.
|
||||
However, it is probable that the releases are independent from each other, and that the data publisher is unaware of these releases when anonymizing the data set.
|
||||
In such a setting, the previous techniques would suffer from composition attacks.
|
||||
The authors define this kind of adversary and propose a hybrid model for data anonymization.
|
||||
More precisely, the publisher/adversary knows that an individual exists in two different anonymized versions of the same data set, he has a hold of the anonymized versions, but the anonymization is done independently (i.e.,~without considering the previously anonymized data sets) for each data set.
|
||||
The key idea in fighting a composition attack is to enforce the probability that the matches among tuples from two data sets are random, linking different rather than the same individual.
|
||||
To do so, the proposed privacy protection method exploits three preprocessing steps before applying a traditional $k$-anonymity or $l$-diversity algorithm.
|
||||
First, the data set is sampled so as to blur the knowledge of the existence of individuals.
|
||||
Then, especially in small data sets, quasi-identifiers are distorted by noise addition before the classical generalization step.
|
||||
The noise is taken from a normal distribution with the mean and standard deviation values calculated on the corresponding quasi-identifier values.
|
||||
In the case of sparse data, the sensitive values are generalized along with the quasi-identifiers.
|
||||
The danger of composition attacks is less prominent when using this method on top of $k$-anonymity rather than without, while having comparable quality results.
|
||||
The authors also provide a comparison to data set release using $\varepsilon$-differential privacy, demonstrating that their techniques are superior with respect to quality because in the opponent algorithm the noise is added up for each of the sensitive attribute to be protected.
|
||||
Even though the authors use in the experiments two different values for $\varepsilon$, a better experiment would have been to compare the quality/privacy ratio between the two methods.
|
||||
This is a good attempt to independently anonymize multiple times the same data set; nevertheless, the scenario is restricted to releases over the same database schema, using the same perturbation, and generalization functions.
|
||||
|
||||
% Publishing trajectories with differential privacy guarantees
|
||||
% - microdata (trajectories)
|
||||
% - finite
|
||||
% - batch
|
||||
% - linkage
|
||||
% - event
|
||||
% - differential privacy
|
||||
% - perturbation (Laplace)
|
||||
% - Seems to belong to the local scheme but in the scenario/evaluation they release multiple trajectories.
|
||||
\hypertarget{jiang2013publishing}{Jiang et al.}~\cite{jiang2013publishing} focus on ship trajectories with known starting and terminal points.
|
||||
More specifically, they study different noise addition mechanisms for publishing trajectories with differential privacy guarantees.
|
||||
These mechanisms include adding global noise to the trajectory, and local noise to either each location point or the coordinates of each point of the trajectory.
|
||||
The first two mechanisms sample noisy radius from an exponential distribution, while the latter adds noise drawn from a Laplace distribution to each coordinate of every location.
|
||||
By comparing these different techniques, they conclude that the latter offers better privacy guarantee and smaller error bound.
|
||||
Nonetheless, the resulting trajectory is noticeably distorted due to the addition of Laplace noise to the original coordinates.
|
||||
To tackle this issue, they design the \emph{Sampling Distance and Direction} (SDD) mechanism.
|
||||
This mechanism allows the publishing of optimal next possible trajectory point by sampling, from the probability distribution of the exponential mechanism,
|
||||
a suitable distance and direction at the current position, while taking into account the ship's maximum speed constraint.
|
||||
Due to the fact that SDD utilizes the exponential mechanism, it outperforms the other three mechanisms, and maintains a good utility-privacy balance.
|
||||
|
||||
% Differentially private trajectory data publication
|
||||
% - microdata (trajectories)
|
||||
% - finite
|
||||
% - batch
|
||||
% - linkage
|
||||
% - user
|
||||
% - differential privacy
|
||||
% - perturbation (Laplace)
|
||||
\hypertarget{chen2011differentially}{Chen et al.}~\cite{chen2011differentially} propose a non-interactive data-dependent privacy-preserving algorithm to generate a differentially private release of trajectory data.
|
||||
The algorithm relies on a noisy prefix tree, i.e.,~an ordered search tree data structure used to store an associative array.
|
||||
Each node represents a location, from a set of possible locations that any user can be present at, of a trajectory and contains a perturbed count, which represents the number of individuals at the current location, with noise drawn from a Laplace distribution.
|
||||
The privacy budget is equally allocated to each level of the tree representing a timestamp.
|
||||
At each level, and for every node, the algorithm seeks for the children nodes with non-zero number of trajectories (non-empty nodes) to continue expanding them.
|
||||
An empty node has a noisy count lower than a threshold that is dependent on the available privacy budget and the height of the tree.
|
||||
All children nodes associate with disjoint data subsets, and thus the algorithm can utilize for every node all of the available budget at every tree level, according to the parallel composition theorem of differential privacy.
|
||||
To generate the anonymized database, it is necessary to traverse the prefix tree once in post-order, paying attention to terminating (empty) nodes.
|
||||
During this process, taking into account some consistency constraints helps to avoid erroneous trajectories due to the noise injection.
|
||||
Namely, each node of a path should have a count that is greater than or equal to the counts of its children, and each node of a path should have a count that is greater than the sum of the counts of all of its children.
|
||||
Increasing the privacy budget results in less average relative error because less noise is added at each level, and thus improves quality.
|
||||
By increasing the height of the tree, the relative error initially decreases as more information is retained from the database.
|
||||
However, after a certain threshold, the increase of height can result in less available privacy budget at each level, and thus more relative error due to the increased perturbation.
|
||||
|
||||
% Protecting Locations with Differential Privacy under Temporal Correlations
|
||||
% - microdata (trajectories)
|
||||
% - finite
|
||||
% - batch
|
||||
% - dependence
|
||||
% - user?
|
||||
% - $\delta$-location set (differential privacy)
|
||||
% - perturbation (Laplace / Planar Isotropic Mechanism (PIM))
|
||||
% - temporal correlations (Markov)
|
||||
% - local
|
||||
\hypertarget{xiao2015protecting}{Xiao et al.}~\cite{xiao2015protecting} propose another privacy definition based on differential privacy that accounts for temporal correlations in geo-tagged data.
|
||||
Location transitions between two consecutive timestamps are determined by temporal correlations modeled through a Markov chain.
|
||||
A \emph{$\delta$-location} set includes all the probable locations a user might appear at, excluding locations of low probability.
|
||||
Therefore, the true location is hidden in the resulting set, in which any pair of locations are indistinguishable.
|
||||
The lower the value of $\delta$, the more locations are included and hence, the higher the level of privacy that is achieved.
|
||||
The authors use the \emph{Planar Isotropic Mechanism} (PIM) as perturbation mechanism, which they designed upon their proof that $l_1$-norm sensitivity fails to capture the exact sensitivity in a multidimensional space.
|
||||
For this reason, PIM utilizes instead \emph{sensitivity hull}, an independent notion of the context of location privacy.
|
||||
In~\cite{xiao2017loclok}, the authors demonstrate the functionality of their system \emph{LocLok}, which implements the concept of $\delta$-location.
|
||||
|
||||
% Time distortion anonymization for the publication of mobility data with high utility
|
||||
% - microdata (trajectory)
|
||||
% - finite
|
||||
% - batch
|
||||
% - linkage
|
||||
% - event
|
||||
% - temporal transformation
|
||||
% - perturbation
|
||||
% - local
|
||||
\hypertarget{primault2015time}{Primault et al.}~\cite{primault2015time} proposed \emph{Promesse}, an algorithm that builds on time distortion instead of location distortion when releasing trajectories.
|
||||
Promesse takes as input an individual's mobility trace comprising of a data set of pairs of geolocations and timestamps, and a parameter $\varepsilon$.
|
||||
The latter indicates the desired distance between the location points that will be publicly released.
|
||||
Initially, Promesse extracts regularly spaced locations, and interpolates each one of the locations at a distance depending on the previous location and the value of $\varepsilon$.
|
||||
Then, it removes the first and last locations of the mobility trace, and assigns uniformly distributed timestamps to the remaining locations of the trajectory.
|
||||
Hence, the resulting trace has a smooth speed, and therefore places where the individual stayed longer, e.g.,~home, work, etc., are indistinguishable.
|
||||
The algorithm needs to know the starting and ending point of the trajectory; thus, it can only apply to offline scenarios.
|
||||
Furthermore, it works better with fine grained data sets because in this way it can achieve optimal geolocation and timestamp pairing.
|
||||
Moreover, the definition of $\varepsilon$ cannot provide versatile privacy protection since it is data dependent.
|
||||
|
||||
% Differentially Private and Utility Preserving Publication of Trajectory Data
|
||||
% - microdata (trajectory)
|
||||
% - finite
|
||||
% - batch
|
||||
% - linkage
|
||||
% - user
|
||||
% - differential privacy
|
||||
% - perturbation (Laplace)
|
||||
% - global
|
||||
\hypertarget{gursoy2018differentially}{Gursoy et al.}~\cite{gursoy2018differentially} designed \emph{DP-Star}, a differential privacy framework that publishes synthetic trajectories featuring similar statistics compared to the original ones.
|
||||
By utilizing the \emph{Minimum Description Length} (MDL) principle~\cite{grunwald2007minimum}, DP-Star eliminates redundant data points in the original trajectories, and generates trajectories containing only representative points.
|
||||
In this way, it is necessary to allocate the available privacy budget to far less data points, striking a balance between preciseness and conciseness.
|
||||
Moreover, the algorithm constructs a density-aware grid, with granularity that adapts to the geographical density of the trajectory points of the data set and preserves the spatial density despite any necessary perturbation.
|
||||
Then, DP-Star preserves the dependence between the trajectories' start and end points by extracting (through a first-order Markov mobility model) the trip distribution, and the intra-trajectory mobility.
|
||||
Finally, a Median Length Estimation (MLE) mechanism approximates the trajectories' lengths, and the framework generates privacy and utility preserving synthetic trajectories.
|
||||
Every phase of the process consumes some predefined privacy budget, keeping the respective products of each phase
|
||||
private and eligible for publishing.
|
||||
The authors compare their design with that of~\cite{chen2012differentially} and~\cite{he2015dpt} by running several tests, and ascertain that it outperforms them in terms of data utility.
|
||||
However, due to DP-Star's privacy budget distribution to its different phases, for small values of $\varepsilon$ the framework's privacy performance is inferior to that of its competitors.
|
||||
|
||||
|
||||
\subsection{Infinite observation}
|
||||
\label{subsec:micro-infinite}
|
||||
|
||||
% Continuous privacy preserving publishing of data streams
|
||||
% - microdata
|
||||
% - infinite
|
||||
% - stream
|
||||
% - as k-anonymity
|
||||
% - event
|
||||
% - k-anonymity
|
||||
% - generalization
|
||||
\hypertarget{zhou2009continuous}{Zhou et al.}~\cite{zhou2009continuous} introduce the problem of infinite private data publishing, and propose a randomized solution based on $k$-anonymity.
|
||||
More precisely, they continuously publish equivalence classes of size greater than or equal to $k$ containing generalized tuples from distinct persons (or identifiers in general).
|
||||
To create the equivalence classes they set several desiderata.
|
||||
Except for the size of a class, which should be greater than or equal to $k$, the information loss occurred by the generalization should be minimal, whereas the delay in forming and publishing the class should be kept low as well.
|
||||
To achieve these requirements, they built a randomized model using the popular structure of $R$-trees, extended to accommodate data density distribution information.
|
||||
In this way, they achieve a better quality/publishing delay ratio for the released private data.
|
||||
On the one hand, the formed classes contain data items that are close to each other (in dense areas), while on the other hand, classes with tuples of sparse areas are released as soon as possible so that the delay will remain low.
|
||||
|
||||
% Maskit: Privately releasing user context streams for personalized mobile applications
|
||||
% - microdata (context)
|
||||
% - infinite
|
||||
% - streaming
|
||||
% - dependence
|
||||
% - event
|
||||
% - $\delta$-privacy
|
||||
% - suppression
|
||||
% - temporal (Markov)
|
||||
% - local
|
||||
\hypertarget{gotz2012maskit}{Gotz et al.}~\cite{gotz2012maskit} developed \emph{MaskIt}, a system that interfaces the sensors of a personal device, identifies various sets of contexts, and releases a stream of privacy-preserving contexts to untrusted applications installed on the device.
|
||||
A context represents the circumstances that form the setting for an event, e.g.,~`at the office', `running', etc.
|
||||
The individuals have to define the sensitive contexts that they wish to be protected, and the desired level of privacy.
|
||||
The system models the individuals' various contexts, and transitions between them.
|
||||
It captures temporal correlations, and models individuals' movement in the space using Markov chains while taking into account historical observations.
|
||||
After the initialization, MaskIt filters a stream of individual's contexts by checking for each context whether it is safe to release it or it is necessary to suppress it.
|
||||
The authors define \emph{$\delta$-privacy} as the privacy model of MaskIt.
|
||||
More specifically, a system preserves $\delta$-privacy
|
||||
if the difference between the posterior and prior knowledge of an adversary after observing an output at any possible timestamp is bounded by $\delta$.
|
||||
After filtering all the elements of an input stream, MaskIt releases an output sequence for a single day.
|
||||
The system can repeat the process to publish longer context streams.
|
||||
The expected number of released contexts quantifies the utility of the system.
|
||||
|
||||
% PLP: Protecting location privacy against correlation analyze Attack in crowdsensing
|
||||
% - microdata (context, location)
|
||||
% - infinite
|
||||
% - streaming
|
||||
% - dependence
|
||||
% - event
|
||||
% - $\delta$-privacy
|
||||
% - suppression
|
||||
% - spatiotemporal (CRF)
|
||||
% - local
|
||||
\hypertarget{ma2017plp}{Ma et al.}~\cite{ma2017plp} propose \emph{PLP} (Protecting Location Privacy), a crowdsensing scheme that protects location privacy against adversaries that can extract spatiotemporal correlations from crowdsensing data.
|
||||
PLP filters an individual's context (location, sensing data) stream while it takes into consideration long-range dependencies among locations and reported sensing data, which are modeled by CRFs.
|
||||
It suppresses sensing data at all sensitive locations while data at non-sensitive locations are reported with a certain probability defined by observing the corresponding CRF model.
|
||||
On the one hand, the scheme estimates the privacy of the reported data by the difference $\delta$ between the probability that an individual would be at a specific location given the supplementary information versus the same probability without the extra information.
|
||||
On the other hand, it quantifies the utility by measuring the total amount of reported data (more is better).
|
||||
An estimation algorithm searches for the optimal strategy that maximizes utility while preserving a predefined privacy threshold.
|
||||
|
||||
% An adaptive geo-indistinguishability mechanism for continuous LBS queries
|
||||
% - microdata
|
||||
% - infinite/finite (not clear)
|
||||
% - streaming
|
||||
% - dependence
|
||||
% - event
|
||||
% - geo-indistinguishability
|
||||
% - perturbation (planar Laplace)
|
||||
% - local
|
||||
\hypertarget{al2018adaptive}{Al-Dhubhani and Cazalas}~\cite{al2018adaptive} propose an adaptive privacy-preserving technique based on geo-indistinguishability, which adjusts the amount of noise required to obfuscate an individual's location based on its correlation level with the previously published locations.
|
||||
Before adding noise, an evaluation of the adversary's ability to estimate an individual's position takes place.
|
||||
This process utilizes a regression algorithm for a certain prediction window that exploits previous location releases.
|
||||
More concretely, in areas with locations presenting strong correlations, an adversary can predict the current location with low estimation error.
|
||||
Consequently, it is necessary to add more noise to the locations prior to their release.
|
||||
Adapting the amount of injected noise depending on the data correlation level might lead to a better performance, in terms of both privacy and utility, in the short term.
|
||||
However, alternating the amount of injected noise at each timestamp, without
|
||||
ensuring the preservation of the features (including correlations) present in the original data, might lead to arbitrary utility loss.
|
||||
|
||||
% Preventing velocity-based linkage attacks in location-aware applications
|
||||
% - microdata (trajectory)
|
||||
% - infinite
|
||||
% - streaming
|
||||
% - dependence (velocity)
|
||||
% - event
|
||||
% - temporal and spatial cloaking
|
||||
% - local and global
|
||||
\hypertarget{ghinita2009preventing}{Ghinita et al.}~\cite{ghinita2009preventing} tackle attacks to location privacy that arise from the linkage of maximum velocity with cloaked regions when using an LBS.
|
||||
The authors propose methods that can prevent the disclosure of the exact location coordinates of an individual, and bound the association probability of an individual to a sensitive location-related feature.
|
||||
The first method is based on temporal cloaking and utilizes deferral, and postdating.
|
||||
Deferral delays the disclosure of a cloaked region that is impossible for an individual to have reached based on the latest region that she published and her known maximum speed.
|
||||
Postdating reports the nearest previous cloaked region that will allow the LBS to return relevant results with high probability, since the two regions are close.
|
||||
The second method implements spatial cloaking.
|
||||
First, it creates cloaked regions by taking into account all of the user-specified sensitive features that are relevant to the current location (filtering of features).
|
||||
Then, it enlarges the area of the region to satisfy the privacy requirements (cloaking).
|
||||
Finally, it defers the publishing of the region until it includes the current timestamp (safety enforcement) similar to temporal cloaking.
|
||||
The system measures the quality of service of both methods in terms of the cloaked region size, time and space error, and failure ratio.
|
||||
The cloaked region size is important because larger regions may decrease the utility of the information that the LBS might return.
|
||||
The time and space error is possible due to delayed location reporting and region cloaking.
|
||||
Failure ratio corresponds to the percentage of dropped queries in cases where it is impossible to satisfy the privacy requirements.
|
||||
Although both methods experimentally prove to offer adequate quality of service, the privacy requirements and metrics that the authors consider do not offer substantial privacy guarantees for commercial application.
|
||||
|
||||
% A Trajectory Privacy-Preserving Algorithm Based on Road Networks in Continuous Location-based Services
|
||||
% - microdata (trajectory)
|
||||
% - infinite
|
||||
% - streaming
|
||||
% - linkage
|
||||
% - event
|
||||
% - $l$-diversity
|
||||
% - generalization (cloaking)
|
||||
% - LBS but global
|
||||
\hypertarget{ye2017trajectory}{Ye et al.}~\cite{ye2017trajectory} present an $l$-diversity method for producing a cloaked area, based on the local road network, for protecting trajectories.
|
||||
A trusted entity divides the spatial region of interest based on the density of the road network, using quadtree structures, until every subregion contains at least $l$ road segments.
|
||||
Then, it creates a database for each subregion by generating all the possible trajectories based on real road network information.
|
||||
The trusted entity uses this database, when individuals attempt to interact with an LBS by sending their current location, to predict their next locations.
|
||||
Thereafter, it selects the $l - 1$ nearest trajectories to the individual's current location, and constructs a minimum cloaking region.
|
||||
The resulting cloaking area covers the $l$ nearest trajectories and ensures a minimum area of coverage.
|
||||
This method addresses the limitations of $k$-anonymity in terms of continuous data publishing of trajectories.
|
||||
The required calculation of every possible trajectory, for the construction of a trajectory database for every subregion, might require an arbitrary amount of computations depending on the area's features.
|
||||
Nonetheless, the utilization of quadtrees can limit the overhead of the searching process.
|
||||
|
||||
% Quantifying Differential Privacy under Temporal Correlations
|
||||
% - statistical
|
||||
% - infinite/finite
|
||||
% - streaming
|
||||
% - dependence
|
||||
% - mainly (w-)event but also user
|
||||
% - differential privacy
|
||||
% - perturbation (Laplace)
|
||||
% - temporal correlations (Markov)
|
||||
\hypertarget{cao2017quantifying}{Cao et al.}~\cite{cao2017quantifying,cao2018quantifying} propose a method for computing the temporal privacy loss of a differential privacy mechanism in the presence of temporal correlations and background knowledge.
|
||||
The goal of their technique is to guarantee privacy protection and to bound the privacy loss at every time point under the assumption of independent data releases.
|
||||
It calculates the temporal privacy loss as the sum of the backward and forward privacy loss minus the default privacy loss $\varepsilon$ of the mechanism (because it is counted twice in the aforementioned entities).
|
||||
This calculation is done for each individual that is included in the original data set, and the overall temporal privacy loss is equal to the maximum calculated value at every time point.
|
||||
The backward/forward privacy loss at any time point depends on the backward/forward privacy loss at the previous/next instance, the backward/forward temporal correlations, and $\varepsilon$.
|
||||
The authors propose solutions to bound the temporal privacy loss, under the presence of weak to moderate correlations, in both finite and infinite data publishing scenarios.
|
||||
In the latter case, they try to find a value for $\varepsilon$ for which the backward and forward privacy loss are equal.
|
||||
In the former, they similarly try to balance the backward and forward privacy loss while they allocate more $\varepsilon$ at the first and last time points, since they have higher impact to the privacy loss of the next and previous ones.
|
||||
This way they achieve an overall constant temporal privacy loss throughout the time series.
|
||||
According to the technique's intuition, stronger correlations result in higher privacy loss.
|
||||
However, the loss is smaller when the dimension of the transition matrix, which is extracted according to the modeling of the correlations (here it is Markov chain), is larger due to the fact that larger transition matrices tend to be uniform, resulting in weaker data dependence.
|
||||
The authors investigate briefly all of the possible privacy levels; however, the solutions that they propose are suitable only for the event-level.
|
||||
Last but not least, the technique requires the calculation of the temporal privacy loss for every individual within the data set which might prove computationally inefficient in real-time scenarios.
|
356
text/related/statistical.tex
Normal file
356
text/related/statistical.tex
Normal file
@ -0,0 +1,356 @@
|
||||
\section{Statistical data}
|
||||
\label{sec:statistical}
|
||||
|
||||
When continuously publishing statistical data, usually in the form of counts, the most widely used privacy method is differential privacy, or derivatives of it, as witnessed in Table~\ref{tab:statistical}.
|
||||
In theory differential privacy makes no assumptions about the background knowledge available to the adversary.
|
||||
In practice, as we observe in Table~\ref{tab:statistical}, data dependencies (e.g.,~correlations) arising in the continuous publication setting are frequently (but without it being the rule) considered as attacks in the proposed algorithms.
|
||||
|
||||
\includetable{table-statistical}
|
||||
|
||||
|
||||
\subsection{Finite observation}
|
||||
\label{subsec:statistical-finite}
|
||||
|
||||
% Practical differential privacy via grouping and smoothing
|
||||
% - statistical (counts)
|
||||
% their scenario is built on location data (check-ins)
|
||||
% - finite
|
||||
% - batch
|
||||
% - linkage
|
||||
% - event
|
||||
% - differential privacy
|
||||
% - perturbation (Laplace)
|
||||
\hypertarget{kellaris2013practical}{Kellaris et al.}~\cite{kellaris2013practical} pointed out that in time series, where users might contribute to an arbitrary number of aggregates, the sensitivity of the query answering function is significantly influenced by their presence/absence in the data set.
|
||||
Thus, the Laplace perturbation algorithm, commonly used with differential privacy, may produce meaningless data sets.
|
||||
Furthermore, under such settings, the discrete Fourier transformation of the Fourier perturbation algorithm (another popular technique for data perturbation) may behave erratically, and affect the utility of the outcome of the mechanism.
|
||||
For this reason, the authors proposed their own method involving grouping and smoothing for one-time publishing of time series of non-overlapping counts, i.e.,~the aggregated data of one count does not affect any other count.
|
||||
Grouping includes partitioning the data set into similar clusters.
|
||||
The size and the similarity measure of the clusters are data dependent.
|
||||
Random grouping consumes less privacy budget, as there is minimum interaction with the original data.
|
||||
However, when using a grouping technique based on sampling, which has some privacy cost but produces better groups, the impact of the perturbation is decreased.
|
||||
During the smoothing phase, the average values for each cluster are calculated, and finally, Laplace noise is added to these values.
|
||||
In this way, the query sensitivity becomes less dependent on each individual's data, and therefore less perturbation is required.
|
||||
|
||||
% Differentially private sequential data publication via variable-length n-grams
|
||||
% - statistical (trajectories)
|
||||
% - finite
|
||||
% - batch
|
||||
% - linkage
|
||||
% - user
|
||||
% - differential privacy
|
||||
% - perturbation (adaptive Laplace)
|
||||
\hypertarget{chen2012differentially}{Chen et al.}~\cite{chen2012differentially} exploit a text-processing technique, the \emph{n-gram} model, i.e.,~a contiguous sequence of $n$ items from a given data sample, to release sequential data without releasing the noisy statistics (counts) of all of the possible sequences.
|
||||
This model allows the publishing of the most common $n$-grams ($n$ is, typically, less than $5$) to accurately reconstruct the original data set.
|
||||
The privacy technique that the authors propose is suitable for count queries and frequent sequential pattern mining scenarios.
|
||||
In particular, one of the applications that the authors consider concerns sequential spatiotemporal data (i.e.,~trajectories) of individuals.
|
||||
They group grams based on the similarity of their $n$ values, construct a search tree, and inject Laplace noise to each node value (count) to achieve user-level differential privacy protection.
|
||||
Instead of allocating the available privacy budget based on the overall maximum height of the tree, they estimate each path adaptively based on known noisy counts.
|
||||
The grouping process continues until the desired threshold of $n$ is reached.
|
||||
Thereafter, they release variable-length $n$-grams with certain thresholds for the values of counts and tree heights, allowing to deal with the trade-off of shorter grams having less information than longer ones but less relative error.
|
||||
They use a set of consistency constraints, i.e.,~the sum of each node's noisy count has to be less than or equal to its parent's noisy count, and all the noisy counts of leaf nodes have to be within a predefined threshold.
|
||||
These constraints improve the final data utility since they result in lower values of $n$.
|
||||
On the one hand, this translates into higher counts, large enough to deal with noise injection and the inherent Markov assumption in the $n$-gram model.
|
||||
On the other hand, it enhances privacy when the universe of all grams with a lower $n$ value is relatively small resulting in more common sequences, which, nonetheless, is rarely valid in real-life scenarios.
|
||||
|
||||
% Differentially private publication of general time-serial trajectory data
|
||||
% - statistical (trajectories)
|
||||
% - finite
|
||||
% - batch
|
||||
% - linkage
|
||||
% - user
|
||||
% - differential privacy
|
||||
% - perturbation (exponential, Laplace)
|
||||
\hypertarget{hua2015differentially}{Hua et al.}~\cite{hua2015differentially} use, similar to the scheme proposed in~\cite{chen2012differentially}, the $n$-grams modeling technique for publishing trajectories containing a small number of $n$-grams, thus, sharing few or even no identical prefixes.
|
||||
They propose a differentially private location-specific generalization algorithm (exponential mechanism), where each position in the trajectory is one record.
|
||||
The algorithm probabilistically partitions the locations at each timestamp with probability proportional to their Euclidean distance from each other.
|
||||
They replace each partition with its centroid and therefore, they offer better utility by creating groups of locations belonging to close trajectories.
|
||||
They optimize the algorithm for time efficiency by using classic $k$-means clustering.
|
||||
Then, the algorithm releases the new trajectories by observing the generalized location partitions, and their perturbed counts (i.e.,~sum of the same locations at each timestamp) with noise drawn from a Laplace distribution.
|
||||
The process continues until the total count of the published trajectories reaches the size of the original data set.
|
||||
They can limit the total number of the possible trajectories by taking into account the individual's moving speed.
|
||||
The authors have measured the utility of distorted spatiotemporal range queries by measuring the Hausdorff distance from the original results and concluded that the utility deterioration is within reasonable boundaries considering the offered privacy guarantees.
|
||||
Similar to~\cite{chen2012differentially}, their approach works well for a small location domain.
|
||||
To make it applicable to realistic scenarios, it is essential to truncate the original trajectories in an effort to reduce the location domain.
|
||||
This results in a coarse discretization of the location area, leading to the arbitrary distortion of the spatial correlations that are present in the original data set.
|
||||
|
||||
% Achieving differential privacy of trajectory data publishing in participatory sensing
|
||||
% - statistical (trajectories)
|
||||
% - finite
|
||||
% - batch
|
||||
% - linkage
|
||||
% - user
|
||||
% - differential privacy
|
||||
% - perturbation (Laplace)
|
||||
\hypertarget{li2017achieving}{Li et al.}~\cite{li2017achieving} focus on publishing a set of trajectories, where, contrary to~\cite{hua2015differentially}, each one is considered as a single entry in the data set.
|
||||
First, using $k$-means clustering they partition the original locations based on their pairwise Euclidean distances.
|
||||
The scheme represents each location partition by their mean (centroid).
|
||||
A larger number of partitions, in areas where close centroids exist, results in fewer locations in each partition, and thus lower trajectory precision loss.
|
||||
Before adding noise, they randomly select partition centroids to generate trajectories until they reach the size of the original data set.
|
||||
Then, they generate Laplace noise, which they bound according to a set of constraints, and they add it to the count of locations of each point of every trajectory.
|
||||
Finally, they release the generalized trajectories along with the noisy count of each location point.
|
||||
The authors prove experimentally that they reduce considerably the trajectory merging time at the expense of utility.
|
||||
|
||||
% DPT: differentially private trajectory synthesis using hierarchical reference systems
|
||||
% - statistical (trajectories)
|
||||
% - finite
|
||||
% - batch
|
||||
% - dependence
|
||||
% - user
|
||||
% - differential privacy
|
||||
% - perturbation (Laplace)
|
||||
% - spatial correlations (Hierarchical Reference Systems (HRS))
|
||||
\hypertarget{he2015dpt}{He et al.} present \emph{DPT} (Differentially Private Trajectory)~\cite{he2015dpt}, a system that synthesizes mobility data based on raw, speed-varying trajectories of individuals, while providing $\varepsilon$-differential privacy protection guarantees.
|
||||
The system constructs a Hierarchical Reference Systems (HRS) model to capture correlations between adjacent locations by imposing a uniform grid at multiple resolutions (i.e.,~for different speed values) over the space, keeping a prefix tree for each resolution, and choosing the centroids as anchor points.
|
||||
In each reference system, anchor points have a small number of neighboring points with increasing (by a constant factor) average distance between them, and fewer children anchor points as the grid resolution becomes finer.
|
||||
DPT estimates transition probabilities only for the anchor points in proximity to the last observed location, and chooses the appropriate reference system for each raw point so that the consecutive points of the trajectory are either neighboring anchors or have a parent-child relationship.
|
||||
The system generates the transition probabilities by estimating the counts in the prefix trees.
|
||||
Thereafter, it chooses the appropriate prefix trees, perturbs them with noise drawn from the Laplace distribution, and adaptively prunes subtrees with low counts to improve the resulting utility.
|
||||
DPT implements a direction-weighted sampling postprocessing strategy for the synthetic trajectories to avoid the loss of directionality of the original trajectories due to the perturbation.
|
||||
Nonetheless, as with all other similar techniques, the usage of prefix trees limits the length of the released trajectories, which results into an uneven spatial distribution.
|
||||
|
||||
% Pufferfish Privacy Mechanisms for Correlated Data
|
||||
% - statistical
|
||||
% - finite
|
||||
% - batch
|
||||
% - dependence
|
||||
% - unspecified
|
||||
% - \emph{Pufferfish}
|
||||
% - perturbation (Laplace)
|
||||
% - general (Bayesian networks/Markov chains)
|
||||
\hypertarget{song2017pufferfish}{Song et al.}~\cite{song2017pufferfish} propose the \emph{Wasserstein mechanism}, a technique that applies to any general instantiation of Pufferfish (see Section~\ref{subsec:privacy-statistical}).
|
||||
It adds noise proportional to the sensitivity of a query $F$, which depends on the worst case distance between the distributions $P(F(X)|s_i,d)$ and $P(F(X)|s_j,d)$ for a variable $X$, a pair of secrets $(s_i,s_j)$, and an evolution scenario $d$.
|
||||
The Wasserstein metric function calculates the worst case distance between those two distributions.
|
||||
The noise is drawn from a Laplace distribution with parameter equal to the quotient resulting from the division of the maximum Wasserstein distance of the distributions of all the pairs of secrets by the available privacy budget $\varepsilon$.
|
||||
For optimization purposes, the authors consider a more restricted setting.
|
||||
This setting, utilizes an evolution scenario for the data correlations representation, and Bayesian networks for the correlation modeling.
|
||||
The authors state that in cases where Bayesian networks are complex, the Markov chains are a more efficient alternative.
|
||||
A generalization of the \emph{Markov blanket} mechanism, the \emph{Markov quilt} mechanism, calculates data dependencies.
|
||||
The dependent nodes of any node consist of its parents, its children, and the other parents of its children.
|
||||
The present technique excels at data sets generated by monitoring applications or networks, but it is not suitable for online scenarios.
|
||||
|
||||
% Differentially private multi-dimensional time series release for traffic monitoring
|
||||
% - statistical (location)
|
||||
% - finite
|
||||
% - streaming
|
||||
% - dependence
|
||||
% - user
|
||||
% - differential privacy
|
||||
% - perturbation (Laplace)
|
||||
% - spatiotemporal/serial correlations
|
||||
\hypertarget{fan2013differentially}{Fan et al.}~\cite{fan2013differentially} propose a real-time framework for releasing differentially private multi-dimensional traffic monitoring data.
|
||||
At every timestamp, the Perturbation module injects noise drawn from a Laplace distribution to the data.
|
||||
Then, the Estimation module post-processes the perturbed data to improve the accuracy.
|
||||
The authors propose a temporal, and spatial estimation algorithm.
|
||||
The former estimates an internal time series model for each location to improve the utility of the perturbation's outcome by performing a posterior estimation that utilizes Gaussian approximation and Kalman filtering\cite{kalman1960new}.
|
||||
The latter reduces data sparsity by grouping neighboring locations using a spatial indexing structure based on quadtree.
|
||||
The Modeling/Aggregation module utilizes domain knowledge, e.g.,~road network and density, and has a bidirectional interaction with the other two in parallel.
|
||||
Although the authors propose the framework for real-time scenarios, they do not deal with infinite data processing/publication, which limits considerably its applicability.
|
||||
|
||||
% An Adaptive Approach to Real-Time Aggregate Monitoring With Differential Privacy
|
||||
% - statistical
|
||||
% - finite
|
||||
% - streaming
|
||||
% - linkage
|
||||
% - user
|
||||
% - differential privacy
|
||||
% - perturbation (dynamic Laplace)
|
||||
In another work, \hypertarget{fan2014adaptive}{Fan et al.} designed \emph{FAST}~\cite{fan2014adaptive}, an adaptive system that allows the release of real-time aggregate time series under user-level differential privacy.
|
||||
These were achieved by using a Sampling, a Perturbation, and a Filtering module.
|
||||
The Sampling module samples on an adaptive rate the aggregates to be perturbed.
|
||||
The Perturbation module adds noise to each sampled point according to the allocated privacy budget.
|
||||
The Filtering module receives the perturbed data point and the original one and generates a posterior estimate, which is finally released.
|
||||
The error between the perturbed and the released (posterior estimate) point is used to adapt the sampling rate; the sampling frequency is increased when data is going through rapid changes and vice-versa.
|
||||
Thus, depending on the adjusted sampling rate, not every single data point is perturbed, saving in this way the available privacy budget.
|
||||
While the system considers the temporal correlations of the processed time series, it does not attempt to deal with the privacy threat that they might pose.
|
||||
|
||||
% CTS-DP: publishing correlated time-series data via differential privacy}
|
||||
% - statistical (they use trajectories in the experiments)
|
||||
% - finite
|
||||
% - streaming
|
||||
% - dependence
|
||||
% - event
|
||||
% - differential privacy
|
||||
% - perturbation (correlated Laplace)
|
||||
% - serial correlations (autocorrelation function)
|
||||
\hypertarget{wang2017cts}{Wang and Zu}~\cite{wang2017cts} defined Correlated Time Series Differential Privacy (\emph{CTS-DP}).
|
||||
The scheme guarantees that the correlation between the perturbation that is introduced by a Correlated Laplace Mechanism (CLM), and the original time series is indistinguishable (Series-Indistinguishability).
|
||||
CTS-DP deals with the shortcomings of independent and identically distributed (i.i.d.) noise under the presence of correlations.
|
||||
I.i.d. noise offers inadequate protection, because refinement methods, e.g.,~filtering, can remove it.
|
||||
Most privacy-preserving methods choose to introduce more noise in the presence of strong correlations thus, diminishing the data utility.
|
||||
An original and a perturbed time series satisfy Series-Indistinguishability if their normalized autocorrelation functions are the same; hence, the two time series are indistinguishable and the published time series satisfies differential privacy as well.
|
||||
The authors consider the fact that, in signal processing, if an i.i.d. signal passes through a filter, which consists of a combination of adders and delayers, it becomes non-i.i.d.
|
||||
Hence, they design CLM, which uses four Gaussian white noise series passed through a linear system, to produce a correlated Laplace noise series according to the autocorrelation function of the original time series.
|
||||
Although the authors prove experimentally that the implementation of CLM outperforms the current state-of-the-art methods, they do not test its robustness against any filter, which they keep as future work.
|
||||
|
||||
|
||||
\subsection{Infinite observation}
|
||||
\label{subsec:statistical-infinite}
|
||||
|
||||
% Private and continual release of statistics
|
||||
% - statistical
|
||||
% - infinite
|
||||
% - streaming
|
||||
% - linkage
|
||||
% - event
|
||||
% - differential privacy
|
||||
% - perturbation (Laplace)
|
||||
\hypertarget{chan2011private}{Chan et al.}~\cite{chan2011private} designed continuous counting mechanisms for finite and infinite data processing and publishing, satisfying $\varepsilon$-differential privacy.
|
||||
Their main contribution lies in proposing the Binary and Hybrid mechanisms, which do not have any upper bound temporal requirements.
|
||||
The mechanisms rely on the release of intermediate partial sums of counts at consecutive timestamp intervals, called \emph{p-sums}, and the injection of noise drawn from a Laplace distribution.
|
||||
The Binary mechanism constructs a binary tree where each node corresponds to a p-sum, and adds noise to each released p-sum proportional to its corresponding length.
|
||||
The Hybrid mechanism publishes counts at sparse time intervals, i.e.,~timestamps that are a power of $2$.
|
||||
Both mechanisms offer event-level protection (pan-privacy) under single unannounced and continual announced intrusions by adding a certain amount of noise to every p-sum in memory.
|
||||
They can facilitate continual top-$k$ queries in recommendation systems, and multidimensional range queries.
|
||||
Furthermore, they are able to support applications that require a consistent output, i.e.,~at each timestamp the counter increases by either $0$ or $1$.
|
||||
|
||||
% Differentially private real-time data release over infinite trajectory streams
|
||||
% - statistical (spatial)
|
||||
% - infinite
|
||||
% - streaming
|
||||
% - linkage
|
||||
% - personalized w-event
|
||||
% - differential privacy
|
||||
% - perturbation (dynamic Laplace)
|
||||
\hypertarget{cao2015differentially}{Cao et al.}~\cite{cao2015differentially} developed a framework that achieves personalized \emph{l-trajectory} privacy protection by dynamically adding noise at each timestamp, which exponentially fades over time.
|
||||
Each individual can specify, in an array of size $l$, the desired protection level for each location of his/her trajectory.
|
||||
The proposed framework is composed of three components.
|
||||
The Dynamic Budget Allocation component allocates portions of the privacy budget to the other two components: a fixed one to the Private Approximation, and a dynamic one to the Private Publishing component at each timestamp.
|
||||
The Private Approximation component estimates, under a utility goal and an approximation strategy, whether it is beneficial to publish approximate data or not.
|
||||
More precisely, it chooses an appropriate previous noisy data release and republishes it if it is similar to the real statistics planned to be published.
|
||||
The Private Publishing component takes as inputs the real statistics, and the timestamp of the approximate data, generated by the Private Approximation component, to be republished.
|
||||
If the timestamp of the approximate data is equal to the current timestamp, then the current data with Laplace noise are published.
|
||||
Otherwise, the data at the corresponding timestamp of the approximate data will be republished.
|
||||
The utilized approximation technique is highly suitable for streaming processing, due to the implementation of approximation that can reduce significantly the privacy budget consumption.
|
||||
However, the framework does not take into account privacy leakage stemming from data dependencies, which limits considerably its applicability in real life data sets.
|
||||
|
||||
% Private decayed predicate sums on streams
|
||||
% - statistical
|
||||
% - infinite
|
||||
% - streaming
|
||||
% - linkage
|
||||
% - w-event
|
||||
% - differential privacy
|
||||
% - perturbation (Laplace)
|
||||
\hypertarget{bolot2013private}{Bolot et al.}~\cite{bolot2013private} introduce the notion of \emph{decayed privacy} in continual observation of aggregates (sums).
|
||||
The authors recognize the fact that monitoring applications focus more on recent events, and data, therefore, the value of previous data releases exponentially fades.
|
||||
This leads to a schema of privacy with expiration, according to which, recent events, and data are more privacy sensitive than those preceding.
|
||||
Based on this, they apply decayed sum functions for answering sliding window queries of fixed window size $w$ on data streams.
|
||||
Namely, window sum compute the difference of two running sums, and exponentially decayed and polynomial decayed sums estimate the sum of decayed data.
|
||||
For every consecutive $w$ data points the algorithm generates binary trees where each node is perturbed with Laplace noise with scale proportional to $w$.
|
||||
Instead of maintaining a binary tree for every window, the algorithm considers the windows that span two blocks as the union of a suffix and a prefix of two consecutive trees.
|
||||
This way, the global sensitivity of the query function is kept low.
|
||||
The proposed techniques are designed for fixed window sizes, hence, when answering multiple sliding window queries with variable window sizes they have to distribute the available privacy budget accordingly.
|
||||
|
||||
% Differentially private event sequences over infinite streams
|
||||
% - statistical
|
||||
% - infinite
|
||||
% - streaming
|
||||
% - linkage
|
||||
% - w-event
|
||||
% - differential privacy
|
||||
% - perturbation (Laplace)
|
||||
Based on the notion of decayed privacy~\cite{bolot2013private}, \hypertarget{kellaris2014differentially}{Kellaris et al.}~\cite{kellaris2014differentially} defined $w$-event privacy in the setting of periodical release of statistics (counts) in infinite streams.
|
||||
To achieve $w$-event privacy, the authors propose two mechanisms (Budget Distribution, and Budget Absorption) based on sliding windows, which effectively distribute the privacy budget to sub-mechanisms (one sub-mechanism per timestamp) applied on the data of a window of the stream.
|
||||
Both algorithms may decide to publish a new noisy count for a specific timestamp, based on the similarity level of the current count with a previously published one.
|
||||
Moreover, both algorithms have the constraint that the total privacy budget consumed in a window is less than or equal to $\varepsilon$.
|
||||
The Budget Distribution algorithm distributes the privacy budget in an exponential-fading manner following the assumption that in a window most of the counts remain similar.
|
||||
The budget of expired timestamps becomes available for the next publications (of next windows).
|
||||
The Budget Absorption algorithm uniformly distributes from the beginning the budget to the window's timestamps.
|
||||
A publication uses not only the by-default allocated budget but also the budget of non-published timestamps.
|
||||
In order to not exceed the limit of $\varepsilon$, adequate number of subsequent timestamps are `silenced' after a publication takes place.
|
||||
Even though one can argue that $w$-event privacy could be achieved by user-level privacy, it is nevertheless non-practical because of the rigidity of the budget allocation that would finally render the output useless.
|
||||
|
||||
% RescueDP: Real-time spatio-temporal crowd-sourced data publishing with differential privacy
|
||||
% - statistical (spatial)
|
||||
% - infinite
|
||||
% - streaming
|
||||
% - linkage
|
||||
% - w-event
|
||||
% - differential privacy
|
||||
% - perturbation (dynamic Laplace)
|
||||
% - serial correlations (Pearson's r)
|
||||
\hypertarget{wang2016rescuedp}{Wang et al.}~\cite{wang2016rescuedp} propose \emph{RescueDP} for the publishing of real-time user-generated spatiotemporal data, utilizing differential privacy with $w$-event-level protection.
|
||||
RescueDP uses a Dynamic Grouping module to create clusters of regions with small statistics, i.e.,~areas with a small number of samples.
|
||||
It estimates the similarity of the data trends of these regions by utilizing the Pearson's correlation coefficient, and creates groups accordingly.
|
||||
The data of each group pass from a Perturbation module that injects Laplace noise to them.
|
||||
The grouping of the previous phase results into the increase of the sample size of each group of regions, which minimizes the error due to the noise injection.
|
||||
The implementation of a Kalman Filtering~\cite{kalman1960new} module further increases the utility of the released data.
|
||||
A Budget Allocation module distributes the available privacy budget to sampling points within any successive $w$ timestamps.
|
||||
RescueDP saves part of the available privacy budget by approximating the non-sampled data with previously released perturbed data.
|
||||
During the whole process, an Adaptive Sampling module adjusts the sampling interval according to the difference in the released data statistics over the previous timestamps while taking into account the remaining privacy budget.
|
||||
|
||||
% RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response
|
||||
% - statistical
|
||||
% - infinite
|
||||
% - streaming
|
||||
% - linkage
|
||||
% - user
|
||||
% - differential privacy
|
||||
% - randomization (randomized response)
|
||||
% - local
|
||||
\hypertarget{erlingsson2014rappor}{Erlingsson et al.}~\cite{erlingsson2014rappor} presented \emph{RAPPOR} (Randomized Aggregatable Privacy-Preserving Ordinal Response) as a solution for privacy-preserving collection of statistics.
|
||||
RAPPOR makes all the necessary data processing on the side of the data generators by applying the method of randomized response, which guarantees local differential privacy.
|
||||
The product of each local privacy-preserving processing is a report that can be represented as a bit string.
|
||||
Each bit corresponds to a randomized response to a logical predicate on an individual's personal data, e.g.,~categorical properties, numerical and ordinal values, or categories that cannot be enumerated.
|
||||
Initially, RAPPOR hashes a sensitive value into a Bloom filter~\cite{bloom1970space}.
|
||||
It creates a binary reporting value, which keeps in its memory (\emph{memoization}) and reuses for future reports (permanent randomized response).
|
||||
Memoization offers long-term longitudinal privacy protection for privacy-sensitive data values that do not change over time or that are not dependent.
|
||||
RAPPOR deals with tracking externalities by reporting a randomized version of the permanent randomized response (instantaneous randomized response).
|
||||
Although this adds an extra layer of randomization to the reported values, it might lead to an averaging attack that may allow an adversary to estimate the true value.
|
||||
Finally, the authors propose a decoding technique that involves grouping, least-squares solving, and regression.
|
||||
This way, they effectively make up for the loss of information due to the randomization of the previous steps and allow the extraction of useful information when observing the generated bit strings.
|
||||
They test their implementation with both simulated and real data, and show that they can extract statistics with high utility while preserving the privacy of the individuals involved.
|
||||
However, the fact that the privacy guarantees of their technique are valid only for stationary individuals that produce independent data on top of the relatively complex configuration, renders their proposal impractical for many real-world scenarios.
|
||||
|
||||
% PrivApprox: privacy-preserving stream analytics
|
||||
% - statistical
|
||||
% - infinite
|
||||
% - streaming
|
||||
% - linkage
|
||||
% - event
|
||||
% - zero-knowledge
|
||||
% - perturbation (randomized response)
|
||||
\hypertarget{quoc2017privapprox}{Le Quoc et al.}~\cite{quoc2017privapprox} propose \emph{PrivApprox}, a data analytics system for privacy-preserving stream processing of distributed data sets that combines sampling and randomized response.
|
||||
The system distributes the analysts' queries to clients via an aggregator and proxies, and employs sliding window computations over batched stream processing to handle the data stream generated by the clients.
|
||||
The clients transmit a randomized response, after sampling the locally available data, to the aggregator via proxies that apply (XOR-based) encryption.
|
||||
The combination of sampling and randomized response achieves \emph{zero-knowledge} based privacy, i.e.,~proving that they know a piece of information without in fact disclosing its actual value.
|
||||
The aggregator collects the received responses and returns statistics to the analysts.
|
||||
The query model expresses the responses of numerical queries as counts within histogram buckets, whereas, for non-numeric queries it specifies each bucket by a matching rule or a regular expression.
|
||||
A confidence metric quantifies the results' approximation from the sampling and randomization.
|
||||
PrivApprox achieves low latency stream processing and enables a synchronization-free distributed architecture that requires low trust to a central entity.
|
||||
Since it implements a sliding window methodology for infinitely processing series of data sets, it would be purposeful to investigate how to achieve $w$-event-level privacy protection.
|
||||
|
||||
% Hiding in the crowd: Privacy preservation on evolving streams through correlation tracking
|
||||
% - statistical
|
||||
% - infinite
|
||||
% - streaming
|
||||
% - data dependence
|
||||
% - event
|
||||
% - randomization
|
||||
% - perturbation (dynamic)
|
||||
% - serial correlations (data trends)
|
||||
\hypertarget{li2007hiding}{Li et al.}~\cite{li2007hiding} attempt to tackle the problem of privacy preservation in numerical data streams taking into account the correlations that may appear continuously among multiple streams and within each one of them.
|
||||
Firstly, the authors define the utility, and privacy specifications.
|
||||
The utility of a perturbed data stream is the inverse of the discrepancy between the original and the perturbed measurements.
|
||||
The discrepancy is set as the normalized Forbenius norm, i.e.,~a matrix norm defined as the square root of the sum of the absolute squares of its elements.
|
||||
Privacy corresponds to the discrepancy between the original and the reconstructed data stream (from the perturbed one), and consists of the removed noise and the error introduced by the reconstruction.
|
||||
Then, correlations come into play.
|
||||
The system continuously monitors the data streams for trends to track correlations, and dynamically perturbs the original numerical data while maintaining the trends that are present.
|
||||
More specifically, the Streaming Correlated Additive Noise (SCAN) module updates the estimation of the local principal components of the original data, and proportionally distributes noise along the components. Thereafter, the Streaming Correlation Online Reconstruction (SCOR) module removes all the noise by utilizing the best linear reconstruction.
|
||||
SCOR is a representation of the ability of any adversarial entity to post-process the released data and attempt to reconstruct the original data set by filtering out any distortion.
|
||||
Overall, the present technique offers robustness against inference attacks by adapting randomization according to data trends, but fails to efficiently quantify the overall privacy guarantee.
|
||||
|
||||
% PeGaSus: Data-Adaptive Differentially Private Stream Processing
|
||||
% - statistical
|
||||
% - infinite
|
||||
% - streaming
|
||||
% - linkage
|
||||
% - event
|
||||
% - differential privacy
|
||||
% - perturbation (Laplace)
|
||||
\hypertarget{chen2017pegasus}{Chen et al.}~\cite{chen2017pegasus} developed \emph{PeGaSus}, an algorithm for event-level differentially private stream processing that supports different categories of stream queries (counts, sliding window, and event monitoring) over multiple stream resolutions.
|
||||
It consists of a Perturber, a Grouper, and a Smoother modules.
|
||||
The Perturber consumes the incoming data stream, adds noise $\varepsilon_p$, which is part of the available privacy budget $\varepsilon$ to each data item, and outputs a stream of noisy data.
|
||||
The data-adaptive Grouper consumes the original stream and partitions the data into well-approximated regions using, also part of the available privacy budget, $\varepsilon_g$.
|
||||
Finally, a query specific Smoother combines the independent information produced by the Perturber and the Grouper, and performs post-processing by calculating the final estimates of the Perturber's values for each partition created by the Grouper at each timestamp.
|
||||
The combination of the Perturber and the Grouper follows the sequential composition and post-processing properties of differential privacy, thus, the resulting algorithm satisfies ($\varepsilon_p + \varepsilon_g$)-differential privacy.
|
4
text/related/summary.tex
Normal file
4
text/related/summary.tex
Normal file
@ -0,0 +1,4 @@
|
||||
\section{Summary}
|
||||
\label{sec:sum-rel}
|
||||
|
||||
This is the summary of this chapter.
|
Reference in New Issue
Block a user