the-last-thing/microdata.tex
2019-03-05 19:56:18 +01:00

91 lines
26 KiB
TeX

\section{Microdata}
\label{sec:microdata}
As observed in Table~\ref{tab:related}, privacy preserving algorithms for microdata rely on $k$-anonymity, or derivatives of it. Ganta et al.~\cite{ganta2008composition} revealed that $k$-anonymity methods are vulnerable to \emph{composition attacks}. Consequently, these attacks drew the attention of researchers, who proposed various algorithms based on $k$-anonymity, each introducing a different dimension on the problem, for instance that previous releases are known to the publisher, or that the quasi-identifiers can be formed by combining attributes in different releases. Note, however, that only one (Li et al.~\cite{li2016hybrid}) of the following works assumes \emph{independently} anonymized data sets that may not be known to the publisher in the attack model, making it more general than the rest of the works.
% \subsection{Continual data}
% \mk{Nothing to put here.}
\subsection{Data streams}
% M-invariance: towards privacy preserving re-publication of dynamic data sets
\hypertarget{xiao2007m}{Xiao et al.}~\cite{xiao2007m} consider the case when a data set is (re)published in different time-shots in an update (tuple delete, insert) manner. More precisely, they address anonymization in dynamic environments by implementing m-\emph{invariance}. In a simple $k$-anonymization (or $l$-diverse) scenario the privacy of an individual that exists in two updates can be compromised by the intersection of the set of sensitive values. In contrast, an individual who exists in a series of $m$-invariant releases, is always associated with the same set of $m$ different sensitive values. To enable the publishing of $m$-invariant data sets, artificial tuples called \emph{counterfeits} may be added in a release. To minimize the noise added to the data sets, the authors provide an algorithm with two extra desiderata: minimize the counterfeits and the quasi-identifiers' generalization level. Still, the choice of adding tuples with specific sensitive values disturbs the value distribution with a direct effect on any relevant statistics analysis.
% Preventing equivalence attacks in updated, anonymized data
In the same update setting (insert/delete), \hypertarget{he2011preventing}{He et al.}~\cite{he2011preventing} introduce another kind of attack, namely the \emph{equivalence} attack, not taken into account by the aforementioned $m$-invariance technique. The equivalence attack allows for sets of individuals (of size $e<m$) to be associated with sets of sensitive values with a probability lower than $m$, in different snap-shots. For example, through tuple deletions, we may infer that two individuals share the exact same sensitive value (thus, may be considered equivalent). In order for a snap-shop of releases to be private, they have to be both $m$-invariant and $e$-equivalent, ($e\leq m$). Subsequently, the authors propose an algorithm incorporating $m$-invariance and based on the graph optimization \emph{min cut} problem, for publishing $e$-equivalent data sets. The proposed method can achieve better levels of privacy, in comparable times and quality as $m$-invariance.
% A hybrid approach to prevent composition attacks for independent data releases
\hypertarget{li2016hybrid}{Li et al.}~\cite{li2016hybrid} identified a common characteristic in most of the privacy techniques: when anonymizing a data set all previous releases are known to the data owner. It is probable however that the releases are independent from each other, and that the data owner is unaware of these releases when anonymizing the data set. In such a setting, the previous techniques would suffer from composition attacks. The authors define this kind of adversary and propose a hybrid model for data anonymization. More precisely, the adversary knows that an individual exists in two different data sets, he has a hold of the anonymized versions, but the anonymization is done independently (i.e.,~without knowledge of the other data set) for each data set. The key idea in fighting a composition attack is to enforce the probability that the matches among tuples from the two data sets are random, linking different rather than the same individual. To do so, the proposed anonymization exploits three preprocessing steps, before applying a traditional $k$-anonymity or $l$-diversity anonymization algorithm. First, the data set is sampled so as to blur the knowledge of the existence of individuals. Then, especially in small data sets, quasi-identifiers are perturbed by noise addition, before the classical generalization step. In addition to quasi-identifiers also the sensitive values are generalized, in the case of sparse data. The danger of composition attacks is less prominent when using this method, on top of $k$-anonymity rather than without, while having comparable quality results. Moreover, the quality results are shown to be substantially better than those obtained by the use of $\varepsilon$-differential privacy. This is a good attempt to independently anonymizing multiple times a data release, however the scenario is restricted to releases over the same database schema, using the same perturbation and generalization functions.
% Continuous privacy preserving publishing of data streams
\hypertarget{zhou2009continuous}{Zhou et al.}~\cite{zhou2009continuous} introduce the problem of continuous private data publication in \emph{streams}, and propose a randomized solution based on $k$-anonymity. In their definition, they state that a private stream consists in publishing equivalence classes of size larger than or equal to $k$ containing generalized tuples from distinct persons (or identifiers in general). To create the equivalence classes they set several desiderata. Except for the size of a class, which should be larger or equal to $k$, the information loss occurred by the generalization should be low, whereas the delay in forming and publishing the class should be low as well. To achieve these they built a randomized model using the popular structure of $R-$trees, extended to accommodate data density distribution information. In this way, they achieve a better quality for the released private data: On the one hand, formed classes contain data items that are close to each other (in dense areas), while on the other hand classes with tuples of sparse areas are released as soon as possible so that the delay will remain low. This work has a special focus on publishing good quality private data. Still, it does not consider attacks where background knowledge exists, nor does it measure the privacy level achieved (other than requiring the size of the released class to be larger or equal to $k$ as in $k$-anonymity), as $\varepsilon$-differential privacy.
% Maskit: Privately releasing user context streams for personalized mobile applications
\hypertarget{gotz2012maskit}{Gotz et al.}~\cite{gotz2012maskit} developed \emph{MaskIt}, a system that interfaces the sensors of a personal device, identifies various sets of \emph{contexts} and releases a stream of privacy preserving contexts to untrusted applications installed on the device. A context is defined as the circumstances that form the setting for an event, e.g.,~`at the office', `running', etc. The users have to define the sensitive contexts that they wish to be protected and the desired level of privacy. The system models the users' various contexts and transitions between them. Temporal correlations are captured using Markov chains by taking into account historical observations. After the initialization, \emph{MaskIt} filters a stream of user contexts by checking for each context whether it is okay to be released or needs to be suppressed. More specifically, a system $A$ preserves \emph{$\delta$-privacy} against an adversary if for all possible inputs $\overrightarrow{x}$ sampled from the Markov chain $M$ with non-zero probability (i.e.~$\Pr[\overrightarrow{x}] > 0$), for all possible outputs $\overrightarrow{o}$ ($\Pr[A(\overrightarrow{x}) = \overrightarrow{o}] > 0$), for all times $t$ and all sensitive contexts $s\in S$, it satisfies the condition $\Pr[X_t = s|\overrightarrow{o}] - \Pr[X_t = s] \leq \delta$. After filtering all the elements of a given stream, an output sequence for a single day is released. The process can be repeated to publish longer context streams. The utility of the system is measured as the expected number of released contexts. Letting the user to define the privacy settings requires that the user has some certain level of relative knowledge, which is not usually the case in real life. Additionally, suppressing data can sometimes disclose more information than releasing them instead, e.g.,~releasing multiple data points around a `sensitive' area (and not inside it) is going to eventually disclose the protected area.
% PLP: Protecting location privacy against correlation analyze Attack in crowdsensing
\hypertarget{ma2017plp}{Ma et al.}~\cite{ma2017plp} propose \emph{PLP} a crowdsensing scheme that protects location privacy against adversaries that can extract spatiotemporal correlations---modeled with CRFs---from crowdsensing data. Users' context (location, sensing data) stream is filtered while long-range dependencies among locations and reported sensing data are taken into account. Sensing data are suppressed at all sensitive locations while data at insensitive locations are reported with a certain probability defined by observing the corresponding CRF model. On the one hand, the privacy of the reported data is estimated by the difference $\delta$ between the probability that a user would be at a specific location given supplementary information versus the same probability without the extra information. On the other hand, the utility of the method depends on the total amount of reported data (more is better). An estimation algorithm searches for the optimal strategy that maximizes utility while preserving a predefined privacy threshold. Although this approach allows users to define their desired privacy prerequisites, it cannot guarantee optimal protection.
\subsection{Sequential data}
% Anonymizing sequential releases
\hypertarget{wang2006anonymizing}{Wang and Fung}~\cite{wang2006anonymizing} address the problem of anonymously releasing different projections of the same data set, in subsequent timestamps. More precisely, the authors want to protect individual information that could be revealed from \emph{joining} various releases of the same data set. To do so, instead of locating the quasi-identifiers in a single release, the authors suggest that the identifiers may span the current and all previous releases of the (projections of the) data set. Then, the proposed method uses the join of the different releases on the common identifying attributes. The goal is to generalize the identifying attributes of the current release, given that previous releases are immutable. The generalization is performed in a top down manner, meaning that the attributes are initially over generalized, and step by step are specialized until they reach the point when predefined quality and privacy requirements are met. The privacy requirements, are the so-called $(X,Y)-privacy$ for a threshold $k$, meaning that the identifying attributes in $X$ are linked with at most $k$ sensitive values in $Y$, in the join of the previously released and current tables. The quality requirements can be tuned into the framework, whereas three alternatives are proposed: the reduction of the class entropy~\cite{quinlan2014c4,shannon2001mathematical}, the notion of distortion, and the discernibility~\cite{bayardo2005data}. The authors propose an algorithm for the release of a table $T1$ in the existence of a previous table $T2$, which takes into account the scalability and performance problems that a join among those two may entail. Still, when many previous releases exist, the complexity would remain high.
% Privacy by diversity in sequential releases of databases
\hypertarget{Shmueli}{Shmueli and Tassa}~\cite{shmueli2015privacy} identified the computational inefficiency of anonymously releasing a data set, taking into account previous ones, in scenarios of sequential publication. In more detail, they consider the case when in subsequent times, projections over different subsets of attributes of a table are published, and they provide an extension for attribute addition. Their algorithm can compute $l-$diverse anonymized releases (over different subsets of attributes) in parallel, by generating $l-1$ so-called \emph{fake} worlds. A fake world is generated from the base table, by randomly permutating non-identifier and sensitive values among the tuples, in such a way that minimal information loss (quality desideratum) is incurred. This is possible, partially by verifying that the permutation is done among quasi-identifiers that are similar. Then, the algorithm creates buckets of tuples with at least $l$ number of different sensitive values, in which the quasi-identifiers will then be generalized in order to achieve $l-$diversity (privacy protection desideratum). The generalization step is also conducted in a information-loss efficient way. All different releases will be $l-$diverse, because they are created assuming the same possible worlds, with which they are consistent. Tuples/attributes deletion is briefly discussed and left as open question. The paper is contrasted with a previous work~\cite{shmueli2012limiting} of the same authors, claiming that the new approach considers a stronger adversary (the adversary knows all individuals with their quasi-identifiers in the database, and not only one), and that the computation is much more efficient, as it does not have an exponential complexity w.r.t. to the number previous publications.
% Differentially private trajectory data publication
\hypertarget{chen2011differentially}{Chen et al.}~\cite{chen2011differentially} propose a non-interactive data-dependent sanitization algorithm to generate a differentially private release for trajectory data. First, a noisy \emph{prefix tree}, i.e.,~an ordered search tree data structure used to store an associative array, is constructed. Each node represents a possible location---a legit location from a set of locations that any user can be present in---of a trajectory and contains a perturbed count---the number of persons in the current location---with noise drawn from a Laplace distribution. The privacy budget is equally allocated to each level of the tree. At each level, and for every node, children nodes with non-zero number of trajectories are identified as \emph{non-empty} by observing noisy counts so as to continue expanding them. All children nodes are associated with disjoint subsets and thus, the parallel composition theorem of differential privacy can be applied. Therefore, all the available budget can be used for each node. An empty node is detected by injecting Laplace noise to its corresponding count and checking if it is less that a preset threshold $\theta=\frac{2\sqrt{2}}{\varepsilon / h}$. Where $\varepsilon$ is the available privacy budget and $h$ the height of the tree. To generate the sanitized database, it is necessary to traverse the prefix tree once in post-order. At each node, the number of terminated trajectories is calculated and corresponding copies of prefixes are sent to the output. During this process, some consistency constraints are taken into account to avoid erroneous trajectories due to the noise added previously. Namely, for any root-to-leaf path $p, \forall v_i \in p, |tr(v_i)| \leq |tr(v_{i+1})|$, where $v_i$ is a child of $v_{i+1}$, and for each node $v, |tr(v)| \geq \sum_{u \in children(v)} |tr(u)|$. The increase of the privacy budget results in less average relative error because less noise is added at each level. By increasing the height of the tree, the relative error initially decreases as more information is retained from the database. However, after a certain threshold, the increase of height can result in less available privacy budget at each level and thus more relative error due to the increased perturbation.
% Publishing trajectories with differential privacy guarantees
\hypertarget{jiang2013publishing}{Jiang et al.}~\cite{jiang2013publishing} focus on ship trajectories with known starting and terminal points. More specifically, they study several different noise addition mechanisms for publishing trajectories with differential privacy guarantees. These mechanisms include adding \emph{global} noise to the trajectory or noise to each location \emph{point} of the trajectory by sampling a noisy radius from an exponential distribution, and adding noise drawn from a Laplace distribution to each \emph{coordinate} of every location point. Upon the comparison of these different techniques, the latter offers better privacy guarantee and smaller error bound, but the resulting trajectory is noticeably distorted raising doubts about its practicality. A \emph{Sampling Distance and Direction (SDD)} mechanism is proposed to tackle the limited practicality coming from the addition of Laplace noise to the trajectory coordinates. It enables the publishing of optimal next possible trajectory point by sampling a suitable distance and direction at the current position and taking into account the ship's maximum speed constraint. The SDD mechanism outperforms other mechanisms and can maintain good utility with very high probability even while offering strong privacy guarantees.
% Anonymity for continuous data publishing
\hypertarget{fung2008anonymity}{Fung et al.}~\cite{fung2008anonymity} introduce the problem of privately releasing continuous \emph{incremental} data sets. The invariant of this kind of releases is that in every timestamp $T_i$, the records previously released in a timestamp $T_j$, where $j<i$, are released again together with a set of new records. The authors first focus in two consecutive releases and describe three classes of possible attacks. They name these attacks \emph{correspondence} attacks because they rely on the principle that all tuples from data set $D1$ correspond to a tuple in the subsequent data set $D2$. Naturally, the opposite does not hold, as tuples with a timestamp $T_2$ do not exist in $D1$. Assuming that the attacker knows the quasi-identifiers and the timestamp of the record of a person, they define the \emph{backward}, \emph{cross} and \emph{forward} (\emph{BCF}) attacks. They show that combining two individually $k$-anonymized subsequent releases using one of the aforementioned attacks can lead to `cracking' some of the records in the set of $k$ candidate tuples rendering the privacy level lower than $k$. Except for the detection of cases of compromising $BCF$ anonymity between two releases, the authors also provide an anonymization algorithm for a release $R2$ in the presence of a private release $R1$. The algorithm starts from the most possible generalized state for the quasi-identifiers of the records in $D2$. Step by step, it checks which combinations of specializations on the attributes do not violate the $BCF$ anonymity and outputs the most possible specialized version of the data set. The authors discuss how the framework extends to multiple releases and to different kinds of privacy methods (other than $k$-anonymization). It is worth noting that in order to maintain a certain quality for a release, it is essential that the delta among subsequent releases is large enough; otherwise the needed generalization level may destroy the utility of the data set.
% Protecting Locations with Differential Privacy under Temporal Correlations
\hypertarget{xiao2015protecting}{Xiao et al.}~\cite{xiao2015protecting} propose another privacy definition based on differential privacy that accounts for temporal correlations in geo-tagged data. Location changes between two consecutive timestamps are determined by temporal correlations modeled through a Markov chain. A \emph{$\delta$-location} set includes all the probable locations a user might appear excluding locations of low probability. Therefore, the true location is hidden in the resulting set in which any pairs of locations are indistinguishable and thus, the user is protected. The lower the value of $\delta$, the more locations are included and hence, the higher level of privacy is achieved. \emph{Planar Isotropic Mechanism (PIM)} is used as a perturbation mechanism to add noise to the released locations. It is proved that $l_1$-norm sensitivity fails to capture the exact sensitivity, i.e.,~the difference between any two query answers from two instances in neighboring databases, in a multidimensional space. For this reason, \emph{sensitivity hull}, an independent notion from the context of location privacy, is utilized instead. In~\cite{xiao2017loclok} they demonstrate the functionality of their system \emph{LocLok} which implements the concept of $\delta$-location. In spite of taking into account temporal correlations for identifying the next possible locations of a user, the proposed definition does not evaluate the corresponding privacy leakage.
% An adaptive geo-indistinguishability mechanism for continuous LBS queries
\hypertarget{al2018adaptive}{Al-Dhubhani et al.}~\cite{al2018adaptive} propose an adaptive privacy preserving technique which adjusts the amount of noise required to obfuscate users' location based on its correlation level with the previous (obfuscated) released locations to deal with correlation analysis attacks. Their technique is based on \emph{geo-indistinguishability}~\cite{andres2013geo}, an adaptation of differential privacy for location data, which adds controlled random noise, to users' locations, drawn from a bivariate Laplace distribution (\emph{Planar Laplace}). The system architecture considered, involves only the users and queried service providers, excluding any third-party entities. After evaluating the adversary's ability to estimate a user's position by utilizing a regression algorithm for a certain prediction window, that exploits previous location releases, noise is added accordingly. I.e., in areas with locations that present strong correlations, therefore, an adversary can predict the current value with lower estimation error, more noise is added to the released locations. The opposite stands for locations with weaker correlations. Adapting the amount of injected noise depending on the data correlation level might lead to a better performance, in terms of both privacy and utility, in the short term. However, alternating the amount of injected noise at each timestamp without taking into account the previously released data, can lead to arbitrary privacy and utility loss in the long term. Applying a filtering algorithm on the perturbed data points, prior to their release, can effectively deal with any possible data discrepancy.
% Preventing velocity-based linkage attacks in location-aware applications
\hypertarget{ghinita2009preventing}{Ghinita et al.}~\cite{ghinita2009preventing} tackle attacks to location privacy that arise from the linkage of maximum user velocity with cloaked regions, due to adversarial background knowledge, when using Location-Based Services. The proposed methods prevent the disclosure of the exact user location coordinates and bound the association probability to a certain user-defined threshold related to user-sensitive features, e.g.,~religious beliefs, health condition, etc., linked to corresponding locations, e.g.,~church, hospital, etc. The first method referred to as \emph{temporal cloaking} is achieved via either \emph{deferral} or \emph{postdating}. The former is applied by delaying the disclosure of a cloaked region that is `too far' from the previous reported region, i.e.,~impossible to have been reached based on the known maximum user speed. The latter requires to report the nearest previous cloaked region and since it is near to the actual region, the corresponding results are highly probable to be relevant. A request is usually postdated when the user-specified threshold is exceeded, otherwise, the nearest candidate region is selected and is deferred or postdated depending on the outcome of the comparison. The second method, \emph{spatial cloaking}, results in the creation of cloaked regions by first taking into account all the relevant user-specified features to the specific location (\emph{filtering of features}) and then, enlarging the area of the region to satisfy the privacy requirements (\emph{cloaking}). Finally, the region is deferred until it includes the current timestamp (\emph{safety enforcement}) similar to temporal cloaking. The final QoS, due to the privacy protection offered by the present methods, is measured in terms of the \emph{cloaked region size}, \emph{time and space error}, and \emph{failure ratio}. The cloaked region size is taken into consideration since larger regions may decrease the usability of the retrieved information. Time and space error is possible due to delayed location reporting and cloaked regions, built around past locations, that do not include the current one. Finally, failure ratio is calculated by measuring the dropped requests in cases where the specified privacy requirements are impossible to be satisfied. Considering the cloak granularity as the only privacy metric proves inadequate since it can be easily compromised in cases of low user presence around the sensitive area.
\subsection{Time series}
% Time distortion anonymization for the publication of mobility data with high utility
\hypertarget{primault2015time}{Primault et al.}~\cite{primault2015time} proposed \emph{Promesse}, an algorithm that builds on time distortion instead of location distortion, to ensure \emph{user-level} privacy when releasing trajectories. \emph{Promesse} takes as input a user's mobility trace comprising of a data set of pairs of geolocations and timestamps, and a parameter \emph{$\varepsilon$}, i.e.,~the privacy budget. Initially, regularly spaced locations are extracted and each one of them is interpolated at a distance depending on the previous location, and the value of $\varepsilon$. Then, the first and last locations of the mobility trace are removed and uniformly distributed timestamps are assigned to the remaining locations of the trajectory. In this way, the resulting trace has a smooth speed and therefore \emph{points of interest (POIs)}, i.e.,~places where the user stayed more time, e.g.,~home, work, etc., are indistinguishable by the adversaries. The present algorithm works better with fine grained data sets, because in this way it can achieve optimal geolocation and timestamp pairing. Furthermore, it can only be used offline, rendering unsuitable for most real life application scenarios.