From 4957a02e80030ec447756bfddcd5006e17a52bf0 Mon Sep 17 00:00:00 2001 From: Manos Katsomallos Date: Thu, 8 Jul 2021 12:59:12 +0200 Subject: [PATCH] Cleaning up --- discussion.tex | 37 ---------------- microdata.tex | 90 -------------------------------------- statistical.tex | 113 ------------------------------------------------ 3 files changed, 240 deletions(-) delete mode 100644 discussion.tex delete mode 100644 microdata.tex delete mode 100644 statistical.tex diff --git a/discussion.tex b/discussion.tex deleted file mode 100644 index ca1121a..0000000 --- a/discussion.tex +++ /dev/null @@ -1,37 +0,0 @@ -\subsection{Discussion} -\label{subsec:discussion} - -In the previous sections we provided a brief summary and review for each work that falls into the categories of Microdata and Statistical Data privacy preserving publication under continual data schemes. -The main elements that have been summarized in Table~\ref{tab:related} allow us to make some interesting observations, on each category individually, and more generally. - -In the Statistical Data section, all of the works deal with data linkage attacks, while there are some more recent works taking into consideration possible data correlations as well. -We notice that data linkage is currently assumed in the bibliography as the worst case scenario. -For this reason, works in the Statistical Data category provide a robust privacy protection solution independent to the adversaries' knowledge. -The prevailing distortion method in this category is probabilistic perturbation. -This is justified by the fact that nearly all of the observed methods are based on differential privacy. -The majority implements the Laplace mechanism, while some of them offer an adaptive approach. - -In the Microdata category we observe that problems with sequential data, i.e.,~data that are generated in a sequence and dependent on the values in previous data sets, are more prominent. -It is important to note that works on this set of problems actually followed similar scenarios, i.e.,~publishing updated versions of an original data set, either vertically (schema-wise) or horizontally (tuple-wise). -Naturally, in such cases the most evident attack scenarios are the complementary release ones, as in each release there is great probability that there will be an intersection of tuples with previous releases. -On the other hand, when the problem has stream data/processing, we observe that these data are location specific, most commonly trajectories. -In such cases, the attacks considered are wider (than only versions of an original data set), taking into account external information, e.g.,~correlations that typically may be available for location specific data. - -Speaking of correlations, in either category, we may see that the protection method used is mainly probabilistic, if not total suppression. -This makes sense, since by generalization the correlation between attributes would not be canceled. -Generalization is used naturally on grouped-based techniques, to make it possible to group more tuples under the generated categories --- and thus achieve anonymization. - -As far as the protection levels are concerned, the Microdata category mainly targets event level protection, as all users are protected equally through the performed grouping. -Still, scenarios that contain trajectories, associated with a certain user aim to protect this user's privacy by blurring the actual trajectories (user-level). -$w-$event level is absent in the Microdata category; one reason maybe that streaming scenarios are not prominent in this category, and another practical reason may be that this notion was introduced later in time. -Indeed, none of the works in the Microdata category explicitly mention the level of privacy, as these levels have been introduced in differential privacy scenarios, hence in Statistical Data. -Considering all the use cases from both categories, event-level protection is more prominent, as it is more practical to protect all the users as a single set than each one individually in continual settings. - -As already discussed, problems with streaming processing are not common in the Microdata category. -Indeed, most of the cases including streaming scenarios are in the Statistical Data category. -A technical reason behind this observation is that anonymizing a raw data set as a whole, may be a time-consuming process, and thus, not well-suited for streaming. -The complexity actually depends on the number of attributes, if we consider the possible combinations that may be enumerated. -On the contrary, aggregation functions as used in the Statistical Data category, especially in the absence of filters, usually are low cost. -Moreover, perturbing a numerical values (the usual result of an aggregation function) does not add a lot in the complexity of the algorithm (depending of course on the perturbation model used). -For this reason, perturbing the result of a process is more time efficient than anonymizing the data set and then running the process on the anonymized data. -Still, we may argue that an anonymized data set can be more widely used; in the case of statistical data it is only the data holder that performs the processes and releases the results. diff --git a/microdata.tex b/microdata.tex deleted file mode 100644 index 2a8b8c1..0000000 --- a/microdata.tex +++ /dev/null @@ -1,90 +0,0 @@ -\section{Microdata} -\label{sec:microdata} - -As observed in Table~\ref{tab:related}, privacy preserving algorithms for microdata rely on $k$-anonymity, or derivatives of it. Ganta et al.~\cite{ganta2008composition} revealed that $k$-anonymity methods are vulnerable to \emph{composition attacks}. Consequently, these attacks drew the attention of researchers, who proposed various algorithms based on $k$-anonymity, each introducing a different dimension on the problem, for instance that previous releases are known to the publisher, or that the quasi-identifiers can be formed by combining attributes in different releases. Note, however, that only one (Li et al.~\cite{li2016hybrid}) of the following works assumes \emph{independently} anonymized data sets that may not be known to the publisher in the attack model, making it more general than the rest of the works. - - -% \subsection{Continual data} - -% \mk{Nothing to put here.} - - -\subsection{Data streams} - -% M-invariance: towards privacy preserving re-publication of dynamic data sets - -\hypertarget{xiao2007m}{Xiao et al.}~\cite{xiao2007m} consider the case when a data set is (re)published in different time-shots in an update (tuple delete, insert) manner. More precisely, they address anonymization in dynamic environments by implementing m-\emph{invariance}. In a simple $k$-anonymization (or $l$-diverse) scenario the privacy of an individual that exists in two updates can be compromised by the intersection of the set of sensitive values. In contrast, an individual who exists in a series of $m$-invariant releases, is always associated with the same set of $m$ different sensitive values. To enable the publishing of $m$-invariant data sets, artificial tuples called \emph{counterfeits} may be added in a release. To minimize the noise added to the data sets, the authors provide an algorithm with two extra desiderata: minimize the counterfeits and the quasi-identifiers' generalization level. Still, the choice of adding tuples with specific sensitive values disturbs the value distribution with a direct effect on any relevant statistics analysis. - - -% Preventing equivalence attacks in updated, anonymized data - -In the same update setting (insert/delete), \hypertarget{he2011preventing}{He et al.}~\cite{he2011preventing} introduce another kind of attack, namely the \emph{equivalence} attack, not taken into account by the aforementioned $m$-invariance technique. The equivalence attack allows for sets of individuals (of size $e 0$), for all possible outputs $\overrightarrow{o}$ ($\Pr[A(\overrightarrow{x}) = \overrightarrow{o}] > 0$), for all times $t$ and all sensitive contexts $s\in S$, it satisfies the condition $\Pr[X_t = s|\overrightarrow{o}] - \Pr[X_t = s] \leq \delta$. After filtering all the elements of a given stream, an output sequence for a single day is released. The process can be repeated to publish longer context streams. The utility of the system is measured as the expected number of released contexts. Letting the user to define the privacy settings requires that the user has some certain level of relative knowledge, which is not usually the case in real life. Additionally, suppressing data can sometimes disclose more information than releasing them instead, e.g.,~releasing multiple data points around a `sensitive' area (and not inside it) is going to eventually disclose the protected area. - - -% PLP: Protecting location privacy against correlation analyze Attack in crowdsensing - -\hypertarget{ma2017plp}{Ma et al.}~\cite{ma2017plp} propose \emph{PLP} a crowdsensing scheme that protects location privacy against adversaries that can extract spatiotemporal correlations---modeled with CRFs---from crowdsensing data. Users' context (location, sensing data) stream is filtered while long-range dependencies among locations and reported sensing data are taken into account. Sensing data are suppressed at all sensitive locations while data at insensitive locations are reported with a certain probability defined by observing the corresponding CRF model. On the one hand, the privacy of the reported data is estimated by the difference $\delta$ between the probability that a user would be at a specific location given supplementary information versus the same probability without the extra information. On the other hand, the utility of the method depends on the total amount of reported data (more is better). An estimation algorithm searches for the optimal strategy that maximizes utility while preserving a predefined privacy threshold. Although this approach allows users to define their desired privacy prerequisites, it cannot guarantee optimal protection. - - -\subsection{Sequential data} - -% Anonymizing sequential releases - -\hypertarget{wang2006anonymizing}{Wang and Fung}~\cite{wang2006anonymizing} address the problem of anonymously releasing different projections of the same data set, in subsequent timestamps. More precisely, the authors want to protect individual information that could be revealed from \emph{joining} various releases of the same data set. To do so, instead of locating the quasi-identifiers in a single release, the authors suggest that the identifiers may span the current and all previous releases of the (projections of the) data set. Then, the proposed method uses the join of the different releases on the common identifying attributes. The goal is to generalize the identifying attributes of the current release, given that previous releases are immutable. The generalization is performed in a top down manner, meaning that the attributes are initially over generalized, and step by step are specialized until they reach the point when predefined quality and privacy requirements are met. The privacy requirements, are the so-called $(X,Y)-privacy$ for a threshold $k$, meaning that the identifying attributes in $X$ are linked with at most $k$ sensitive values in $Y$, in the join of the previously released and current tables. The quality requirements can be tuned into the framework, whereas three alternatives are proposed: the reduction of the class entropy~\cite{quinlan2014c4,shannon2001mathematical}, the notion of distortion, and the discernibility~\cite{bayardo2005data}. The authors propose an algorithm for the release of a table $T1$ in the existence of a previous table $T2$, which takes into account the scalability and performance problems that a join among those two may entail. Still, when many previous releases exist, the complexity would remain high. - - -% Privacy by diversity in sequential releases of databases - -\hypertarget{Shmueli}{Shmueli and Tassa}~\cite{shmueli2015privacy} identified the computational inefficiency of anonymously releasing a data set, taking into account previous ones, in scenarios of sequential publication. In more detail, they consider the case when in subsequent times, projections over different subsets of attributes of a table are published, and they provide an extension for attribute addition. Their algorithm can compute $l-$diverse anonymized releases (over different subsets of attributes) in parallel, by generating $l-1$ so-called \emph{fake} worlds. A fake world is generated from the base table, by randomly permutating non-identifier and sensitive values among the tuples, in such a way that minimal information loss (quality desideratum) is incurred. This is possible, partially by verifying that the permutation is done among quasi-identifiers that are similar. Then, the algorithm creates buckets of tuples with at least $l$ number of different sensitive values, in which the quasi-identifiers will then be generalized in order to achieve $l-$diversity (privacy protection desideratum). The generalization step is also conducted in a information-loss efficient way. All different releases will be $l-$diverse, because they are created assuming the same possible worlds, with which they are consistent. Tuples/attributes deletion is briefly discussed and left as open question. The paper is contrasted with a previous work~\cite{shmueli2012limiting} of the same authors, claiming that the new approach considers a stronger adversary (the adversary knows all individuals with their quasi-identifiers in the database, and not only one), and that the computation is much more efficient, as it does not have an exponential complexity w.r.t. to the number previous publications. - - -% Differentially private trajectory data publication - -\hypertarget{chen2011differentially}{Chen et al.}~\cite{chen2011differentially} propose a non-interactive data-dependent sanitization algorithm to generate a differentially private release for trajectory data. First, a noisy \emph{prefix tree}, i.e.,~an ordered search tree data structure used to store an associative array, is constructed. Each node represents a possible location---a legit location from a set of locations that any user can be present in---of a trajectory and contains a perturbed count---the number of persons in the current location---with noise drawn from a Laplace distribution. The privacy budget is equally allocated to each level of the tree. At each level, and for every node, children nodes with non-zero number of trajectories are identified as \emph{non-empty} by observing noisy counts so as to continue expanding them. All children nodes are associated with disjoint subsets and thus, the parallel composition theorem of differential privacy can be applied. Therefore, all the available budget can be used for each node. An empty node is detected by injecting Laplace noise to its corresponding count and checking if it is less that a preset threshold $\theta=\frac{2\sqrt{2}}{\varepsilon / h}$. Where $\varepsilon$ is the available privacy budget and $h$ the height of the tree. To generate the sanitized database, it is necessary to traverse the prefix tree once in post-order. At each node, the number of terminated trajectories is calculated and corresponding copies of prefixes are sent to the output. During this process, some consistency constraints are taken into account to avoid erroneous trajectories due to the noise added previously. Namely, for any root-to-leaf path $p, \forall v_i \in p, |tr(v_i)| \leq |tr(v_{i+1})|$, where $v_i$ is a child of $v_{i+1}$, and for each node $v, |tr(v)| \geq \sum_{u \in children(v)} |tr(u)|$. The increase of the privacy budget results in less average relative error because less noise is added at each level. By increasing the height of the tree, the relative error initially decreases as more information is retained from the database. However, after a certain threshold, the increase of height can result in less available privacy budget at each level and thus more relative error due to the increased perturbation. - - -% Publishing trajectories with differential privacy guarantees - -\hypertarget{jiang2013publishing}{Jiang et al.}~\cite{jiang2013publishing} focus on ship trajectories with known starting and terminal points. More specifically, they study several different noise addition mechanisms for publishing trajectories with differential privacy guarantees. These mechanisms include adding \emph{global} noise to the trajectory or noise to each location \emph{point} of the trajectory by sampling a noisy radius from an exponential distribution, and adding noise drawn from a Laplace distribution to each \emph{coordinate} of every location point. Upon the comparison of these different techniques, the latter offers better privacy guarantee and smaller error bound, but the resulting trajectory is noticeably distorted raising doubts about its practicality. A \emph{Sampling Distance and Direction (SDD)} mechanism is proposed to tackle the limited practicality coming from the addition of Laplace noise to the trajectory coordinates. It enables the publishing of optimal next possible trajectory point by sampling a suitable distance and direction at the current position and taking into account the ship's maximum speed constraint. The SDD mechanism outperforms other mechanisms and can maintain good utility with very high probability even while offering strong privacy guarantees. - - -% Anonymity for continuous data publishing - -\hypertarget{fung2008anonymity}{Fung et al.}~\cite{fung2008anonymity} introduce the problem of privately releasing continuous \emph{incremental} data sets. The invariant of this kind of releases is that in every timestamp $T_i$, the records previously released in a timestamp $T_j$, where $j