Merge branch 'master' of git.delkappa.com:manos/the-last-thing

This commit is contained in:
Manos Katsomallos 2021-09-17 20:27:22 +03:00
commit a1bc5478bb
2 changed files with 60 additions and 39 deletions

View File

@ -1,10 +1,12 @@
\section{Data correlation}
\label{sec:correlation}
\kat{Please add some introduction to each section, presenting what you will discuss afterwards, and link it somehow to what was already discussed.}
\subsection{Types of correlation}
\label{subsec:cor-types}
The most prominent types of correlation might be:
The most prominent types of correlation are:
\begin{itemize}
\item \emph{Temporal}~\cite{wei2006time}---appearing in observations (i.e.,~values) of the same object over time.
@ -15,7 +17,7 @@ The most prominent types of correlation might be:
Contrary to one-dimensional correlation, spatial correlation is multi-dimensional and multi-directional, and can be measured by indicators (e.g.,~\emph{Moran's I}~\cite{moran1950notes}) that reflect the \emph{spatial association} of the concerned data.
Spatial autocorrelation has its foundations in the \emph{First Law of Geography} stating that ``everything is related to everything else, but near things are more related than distant things''~\cite{tobler1970computer}.
A positive spatial autocorrelation indicates that similar data are \emph{clustered}, a negative that data are dispersed and are close to dissimilar ones, and when close to zero, that data are \emph{randomly arranged} in space.
\kat{I still do not like this focus on spatial correlation.. maybe remove it totally? we only consider temporal correlation in the main work in any case.}
\subsection{Extraction of correlation}
\label{subsec:cor-ext}
@ -30,7 +32,7 @@ Some common stochastic processes modeling techniques include:
\begin{itemize}
\item \emph{Conditional probabilities}~\cite{allan2013probability}---probabilities of events in the presence of other events.
\item \emph{Conditional Random Fields} (CRFs)~\cite{lafferty2001conditional}---undirected graphs encoding conditional probability distributions.
\item \emph{Markov processes}~\cite{rogers2000diffusions}---stochastic processes for which the conditional probability of their future states depends only on the present state and it is independent of its previous states (\emph{Markov assumption}).
\item \emph{Markov processes}~\cite{rogers2000diffusions}---stochastic processes for which the conditional probability of their future states depends only on the present state and it is independent of its previous states (\emph{Markov assumption}). We highlight the following two sub-categories:
\begin{itemize}
\item \emph{Markov chains}~\cite{gagniuc2017markov}---sequences of possible events whose probability depends on the state attained in the previous event.
\item \emph{Hidden Markov Models} (HMMs)~\cite{baum1966statistical}---statistical Markov models of Markov processes with unobserved states.
@ -45,7 +47,7 @@ Correlation appears in dependent data:
\begin{itemize}
\item within one data set, and
\item within one data set and among one data set and previous data releases, and/or other external sources~\cite{kifer2011no, chen2014correlated, liu2016dependence, zhao2017dependent}.
\item among one data set and previous data releases, and/or other external sources~\cite{kifer2011no, chen2014correlated, liu2016dependence, zhao2017dependent}.
\end{itemize}
In the former case, data tuples and data values within a data set may be correlated, or linked in such a way that information about one person can be inferred even if the person is absent from the database.

View File

@ -94,6 +94,7 @@ Moreover, in user-level (Figure~\ref{fig:level-user}) it is hard to determine wh
Finally, in $2$-event-level (Figure~\ref{fig:level-w-event}) it is hard to determine whether Quackmore was ever included in the released series of events between the timestamps $t_1$ and $t_2$, $t_2$ and $t_3$, etc. (i.e.,~for a window $w = 2$).
\kat{Already, by looking at the original counts, for the reader it is hard to see if Quackmore was in the event/database. So, we don't really get the difference among the different levels here.}
\mk{It is without background knowledge.}
\kat{But you discuss event and level here by showing just counts, with no background knowledge, and you want the reader to understand how in one case we are not sure if he participated in the event t1 or in any of the events. It is not clear to me what is the difference, just by looking at the example with the counts. }
\begin{figure}[htp]
\centering
@ -114,7 +115,7 @@ Finally, in $2$-event-level (Figure~\ref{fig:level-w-event}) it is hard to deter
\end{figure}
Contrary to event-level, which provides privacy guarantees for a single event, user- and $w$-event-level offer stronger privacy protection by protecting a series of events.
Event- and $w$-event-level handle better scenarios of infinite data observation, whereas user-level is more appropriate when the span of data observation is finite.
Event- and $w$-event-level better fit scenarios of infinite data observation, whereas user-level is more appropriate when the span of data observation is finite.
$w$-event- is narrower than user-level protection due to its sliding window processing methodology.
In the extreme cases where $w$ is equal either to $1$ or to the length of the time series, $w$-event- matches event- or user-level protection, respectively.
Although the described levels have been coined in the context of \emph{differential privacy}~\cite{dwork2006calibrating}, a seminal privacy method that we will discuss in more detail in Section~\ref{subsec:prv-statistical}, they are used for other privacy protection techniques as well.
@ -123,17 +124,18 @@ Although the described levels have been coined in the context of \emph{different
\subsection{Privacy-preserving operations}
\label{subsec:prv-operations}
Protecting private information
%Protecting private information
% , which is known by many names (obfuscation, cloaking, anonymization, etc.),
% \kat{the techniques are not equivalent, so it is correct to say that they are different names for the same thing}
is achieved by using a specific basic
%is achieved by using a specific basic
% \kat{but later you mention several ones.. so what is the specific basic one ?}
privacy protection operation.
Depending on the
technique
%privacy protection operation.
%Depending on the
%technique
% intervention
% \kat{?, technique, algorithm, method, operation, intervention.. we are a little lost with the terminology and the difference among all these }
that we choose to perform on the original data, we identify the following operations:
%that we choose to perform on the original data,
We identify the following privacy operations that can be applied on the original data to achieve privacy preservation:
% \kat{you can mention that the different operations have different granularity}
% \mk{``granularity''?}
@ -153,11 +155,11 @@ that we choose to perform on the original data, we identify the following operat
\end{itemize}
For example, consider the table schema \emph{User(Name, Age, Location, Status)}.
If we want to protect the \emph{Age} of the user by aggregation, we may replace it by the average age in her Location; by generalization, we may replace the Age by age intervals; by suppression we may delete the entire table column corresponding to \emph{Age}; by perturbation, we may augment each age by a predefined percentage of the age; by randomization we may randomly replace each age by a value taken from the probability density function of the attribute.
If we want to protect the \emph{Age} of the user by aggregation, we may replace it by the average age in her Location\kat{This example does not follow the description you give before for aggregation. Indeed, it fits better the perturbation (you replaced the value with the average age of the same location, which is a deterministic process). Don't you mean counts by aggregation? If you mean aggregation as in sql functions then you should not say in the definition that you replace the rows with the aggregate, but a specific attribute's value. }; by generalization, we may replace the Age by age intervals; by suppression we may delete the entire table column corresponding to \emph{Age}; by perturbation, we may augment each age by a predefined percentage of the age; by randomization we may randomly replace each age by a value taken from the probability density function of the attribute.
It is worth mentioning that there is a series of algorithms (e.g.,~\cite{benaloh2009patient, kamara2010cryptographic, cao2014privacy}) based on the \emph{cryptography} operation.
However, the majority of these methods, among other assumptions that they make, have minimum or even no trust to the entities that handle the personal information.
Furthermore, the amount and the way of data processing of these techniques usually burden the overall procedure, deteriorate the utility of the resulting data sets to a point where they are completely useless, and restrict their applicability.
Furthermore, the amount and the way of data processing of these techniques usually burden the overall procedure, deteriorate the utility of the resulting data sets to a point where they are completely useless, and thus restrict their usage by third-parties.
% \kat{All these points apply also to the non-cryptography techniques. So you should mostly point out that they do not only deteriorate the utility but make them non-usable at all.}
Our focus is limited to techniques that achieve a satisfying balance between both participants' privacy and data utility.
% For these reasons, there will be no further discussion around this family of techniques in this article.
@ -170,15 +172,18 @@ Our focus is limited to techniques that achieve a satisfying balance between bot
For completeness, in this section we present the seminal works for privacy-preserving data publishing, which, even though originally designed for the snapshot publishing scenario,
% \kat{was dp designed for the snapshot publishing scenario?}
% \mk{Not clearly but yes. We can write it since DP was coined in 2006, while DP under continual observation came later in 2010.}
have paved the way, since many of the works in privacy-preserving continuous publishing are based on or extend them.
have paved the way
%, since many
of privacy-preserving continuous publishing as well.
% are based on or extend them.
\subsubsection{Microdata}
\label{subsec:prv-micro}
Sweeney coined \emph{$k$-anonymity}~\cite{sweeney2002k}, one of the first established works on data privacy.
A released data set features $k$-anonymity protection when the sequence of values for a set of identifying attributes, called the \emph{quasi-identifiers}, is the same for at least $k$ records in the data set.
Computing the quasi-identifiers in a set of attributes is still a hard problem on its own~\cite{motwani2007efficient}.
A released data set features $k$-anonymity protection when the values of a set of identifying attributes, called the \emph{quasi-identifiers}, is the same for at least $k$ records in the data set.
Computing the quasi-identifiers in a set of attributes is still a hard problem on its own~\cite{motwani2007efficient}.\kat{yes indeed, but seems out of context here.}
% $k$-anonymity
% is syntactic,
% \kat{meaning?}
@ -207,24 +212,29 @@ Proposed solutions include rearranging the attributes, setting the whole attribu
\label{subsec:prv-statistical}
While methods based on $k$-anonymity have been mainly employed for releasing microdata, \emph{differential privacy}~\cite{dwork2006calibrating} has been proposed for releasing high utility aggregates over microdata while providing semantic
% \kat{semantic ?}
\kat{semantic ?}
privacy guarantees that characterize the output data.
Differential privacy is algorithmic,
% \kat{algorithmic? moreover, you repeat this sentence later on, after the definition of neighboring datasets}
it characterizes the data publishing process which passes its privacy guarantee to the resulting data.
It ensures that any adversary observing a privacy-protected output, no matter their computational power or auxiliary information, cannot conclude with absolute certainty if an individual is included in the input data set.
Moreover, it quantifies and bounds the impact that the addition/removal of an individual to/from a data set has on the derived privacy-protected aggregates thereof.
it characterizes the data publishing process, which passes its privacy guarantee to the resulting data.
It ensures that any adversary observing a privacy-protected output, no matter their computational power or auxiliary information, cannot conclude with absolute certainty if an individual is included in the input data set (Definition~\ref{def:nb-d-s}).
\begin{definition}
[Neighboring data sets]
\label{def:nb-d-s}
Two data sets are neighboring (or adjacent) when they differ by at most one tuple, i.e.,~one can be obtained by adding/removing the data of an individual to/from the other.
\end{definition}
Moreover, differential privacy quantifies and bounds the impact that the addition/removal of an individual to/from a data set has on the derived privacy-protected aggregates thereof.
More precisely, differential privacy quantifies the impact of the addition/removal of a single tuple in $D$ on the output $\pmb{o}$ of a privacy mechanism $\mathcal{M}$.
% \kat{what is M?}
The distribution of all $\pmb{o}$, in some range $\mathcal{O}$, is not affected \emph{substantially}, i.e.,~it changes only slightly due to the modification of any one tuple in all possible $D \in \mathcal{D}$.
Formally, differential privacy is given in Definition~\ref{def:dp}.
% \kat{introduce the following definition, and link it to the text before. Maybe you can put the definition after the following paragraph.}
\begin{definition}
[Neighboring data sets]
\label{def:nb-d-s}
Two data sets are neighboring (or adjacent) when they differ by at most one tuple, i.e.,~one can be obtained by adding/removing the data of an individual to/from the other.
\end{definition}
% Thus, differential privacy
% is algorithmic,
@ -233,7 +243,7 @@ The distribution of all $\pmb{o}$, in some range $\mathcal{O}$, is not affected
% ensures that any adversary observing any $\pmb{o}$ cannot conclude with absolute certainty whether or not any individual is included in any $D$.
% Its performance is irrelevant to the computational power and auxiliary information available to an adversary observing the outputs of $\mathcal{M}$.
% \kat{you already said this. Moreover, it is irrelevant to the neighboring datasets and thus does not fit here..}
\kat{Say what is a mechanism and how it is connected to the query, what are their differences? In the next section that you speak about the examples, we are still not sure about what is a mechanism in general.}
\begin{definition}
[Differential privacy]
\label{def:dp}
@ -245,9 +255,9 @@ The distribution of all $\pmb{o}$, in some range $\mathcal{O}$, is not affected
% $\pmb{o}$
% \kat{there is no o in the definition above}
% as output
from all possible $O \subseteq \mathcal{O}$, when given $D$ as input.
from $O \subseteq \mathcal{O}$, when given $D$ as input.
The \emph{privacy budget} $\varepsilon$ is a positive real number that represents the user-defined privacy goal~\cite{mcsherry2009privacy}.
As the definition implies, $\mathcal{M}$ achieves stronger privacy protection for lower values of $\varepsilon$ since the probabilities of $D$ and $D'$ being true worlds are similar, but the utility of tje output
As the definition implies, $\mathcal{M}$ achieves stronger privacy protection for lower values of $\varepsilon$ since the probabilities of $D$ and $D'$ being true worlds are similar, but the utility of the output
% $\pmb{o}$
% \kat{there is no o in the definition above}
is reduced since more randomness is introduced by $\mathcal{M}$.
@ -264,10 +274,11 @@ of differential privacy mechanisms is inseparable from the query's
function sensitivity.
The presence/absence of a single record should only change the result slightly,
% \kat{do you want to say 'should' and not 'can'?}
and therefore differential privacy methods are best for low sensitivity queries such as counts.
and therefore differential privacy methods are best for low sensitivity queries (see Definition~\ref{def:qry-sens}) such as counts.
However, sum, max, and in some cases average
% \kat{and average }
queries can be problematic since a single (but outlier) value could change the output noticeably, making it necessary to add a lot of noise to the query's answer.
queries can be problematic, since a single, outlier value could change the output noticeably, making it necessary to add a lot of noise to the query's answer.
% \kat{introduce and link to the previous text the following definition }
@ -279,7 +290,8 @@ queries can be problematic since a single (but outlier) value could change the o
\end{definition}
\paragraph{Privacy mechanisms}
\paragraph{Popular privacy mechanisms}
\label{subsec:prv-mech}
A typical example of a differential privacy mechanism is the \emph{Laplace mechanism}~\cite{dwork2014algorithmic}.
It draws randomly a value from the probability distribution of $\textrm{Laplace}(\mu, b)$, where $\mu$ stands for the location parameter and $b > 0$ is the scale parameter (Figure~\ref{fig:laplace}).
@ -294,9 +306,9 @@ A specialization of this mechanism for location data is the \emph{Planar Laplace
\label{fig:laplace}
\end{figure}
For query functions that do not return a real number, e.g.,~`What is the most visited country this year?' or in cases where perturbing the value of the output will completely destroy its utility, e.g.,~`What is the optimal price for this auction?', most works in the literature use the \emph{Exponential mechanism}~\cite{mcsherry2007mechanism}.
This mechanism utilizes a utility function $u$ that maps (input data set $D$, output value $r$) pairs to utility scores, and selects an output value $r$ from the input pairs, with probability proportional to $\exp(\frac{\varepsilon u(D, r)}{2\Delta u})$,
where $\Delta u$ is the sensitivity of the utility \kat{what is the utility function?} function.
For query functions that do not return a real number, e.g.,~`What is the most visited country this year?' or in cases where perturbing the value of the output will completely destroy its utility, e.g.,~`What is the optimal price for this auction?' most works in the literature use the \emph{Exponential mechanism}~\cite{mcsherry2007mechanism}.
This mechanism utilizes a utility function $u$ that maps (input data set $D$, output value $r$) pairs to utility scores, and selects an output value $r$ from the input pairs with probability proportional to $\exp(\frac{\varepsilon u(D, r)}{2\Delta u})$.
$\Delta u$ is the sensitivity of the utility \kat{what is the utility function?} function.
Another technique for differential privacy mechanisms is the \emph{randomized response}~\cite{warner1965randomized}.
It is a privacy-preserving survey method that introduces probabilistic noise to the statistics of a research by randomly instructing respondents to answer truthfully or `Yes' to a sensitive, binary question.
@ -304,14 +316,17 @@ The technique achieves this randomization by including a random event, e.g.,~the
The respondents reveal to the interviewers only their answer to the question, and keep as a secret the result of the random event (i.e.,~if the coin was tails or heads).
Thereafter, the interviewers can calculate the probability distribution of the random event, e.g.,~$\frac{1}{2}$ heads and $\frac{1}{2}$ tails, and thus they can roughly eliminate the false responses and estimate the final result of the research.
A special category of differential privacy-preserving algorithms is that of \emph{pan-private} algorithms~\cite{dwork2010pan}.
\kat{is the following two paragraphs still part of the examples of privacy mechanisms? I am little confused here.. if the section is not only for examples, then you should introduce it somehow (and not start directly by saying 'A typical example...')}
A special category of differential privacy-preserving algorithms \kat{algorithms? why not mechanisms ?} is that of \emph{pan-private} algorithms~\cite{dwork2010pan}.
Pan-private algorithms hold their privacy guarantees even when snapshots of their internal state (memory) are accessed during their execution by an external entity, e.g.,~subpena, security breach, etc.
There are two intrusion types that a data publisher has to deal with when designing a pan-private mechanism: \emph{single unannounced}, and \emph{continual announced} intrusion.
In the first, the data publisher assumes that the mechanism's state is observed by the external entity one unique time, without the data publisher ever being notified about it.
In the latter, the external entity gains access to the mechanism's state multiple times, and the publisher is notified after each time.
The simplest approach to deal with both cases is to make sure that the data in the memory of the mechanism have constantly the same distribution, i.e.,~they are differentially private.
Notice that this must hold throughout the mechanism's lifetime, even before/\allowbreak after it processes any sensitive data item(s).
Notice that this must hold throughout the mechanism's lifetime, even before/\allowbreak after it processes any sensitive data item(s). \kat{what do you mean here? even if it processes non-sensitive items before or after?}
\kat{The way you start this paragraph is more suited for the related work. If you want to present Pufferfish as a background knowledge, do it directly. But in my opinion, since you do not use it for your work, there is no meaning for putting this in your background section. Mentioning it in the related work is sufficient. Same for geo-indistinguishability. }
The notion of differential privacy has highly influenced the research community, resulting in many follow-up publications (\cite{mcsherry2007mechanism, kifer2011no, zhang2017privbayes} to mention a few).
We distinguish here \emph{Pufferfish}~\cite{kifer2014pufferfish} and \emph{geo-indistinguishability}~\cite{andres2013geo,chatzikokolakis2015geo}.
\emph{Pufferfish} is a framework that allows experts in an application domain, without necessarily having any particular expertise in privacy, to develop privacy definitions for their data sharing needs.
@ -327,6 +342,10 @@ This similarity depends on $r$ because the closer two locations are, the more li
Intuitively, the definition implies that if an adversary learns the published location for an individual, the adversary cannot infer the individual's true location, out of all the points in a radius $r$, with a certainty higher than a factor depending on $l$.
The technique adds random noise drawn from a multivariate Laplace distribution to individuals' locations, while taking into account spatial boundaries and features.
\bigskip
In what follows, we present some primordial properties of differential private mechanisms that rule their composition and post processing.
\paragraph{Composition}
\label{subsec:compo}
@ -373,7 +392,7 @@ When the users consider recent data releases more privacy sensitive than distant
\end{theorem}
When dealing with temporally correlated data, we handle a sequence of $w \leq t \in \mathbb{Z}^+$ mechanisms (indexed by $m \in [1, t]$) as a single entity where each mechanism contributes to the temporal privacy loss depending on its order of application~\cite{cao2017quantifying}.
The first ($m - 1$ if $w \leq 2$ or $m - w + 1$ if $w > 2$) and last ($m$) mechanisms contribute to the backward and forward temporal privacy loss respectively.
The first ($m - 1$ if $w \leq 2$ or $m - w + 1$ if $w > 2$) and last ($m$) mechanisms contribute to the backward and forward temporal privacy loss respectively (see also Section~\ref{subsec:cor-temp}).
When $w$ is greater than $2$, the rest of the mechanisms (between $m - w + 2$ and $m - 1$) contribute only to the privacy loss that is corresponding to the publication of the relevant data.
\begin{theorem}
@ -407,7 +426,7 @@ However, the \emph{post-processing} of a perturbed data set can be done without
The post-processing of any output of an $\varepsilon$-differential privacy mechanism shall not deteriorate its privacy guarantee.
\end{theorem}
Naturally, using the same (or different) privacy mechanism(s) multiple times to interact with raw data in combination with already perturbed data implies that the privacy guarantee of the final output will be calculated according to Theorem~\ref{theor:compo-seq-ind}.
Naturally, using the same (or different) privacy mechanism(s) multiple times to interact with raw data in combination with already perturbed data implies that the privacy guarantee of the final output will be calculated according to Theorem~\ref{theor:compo-seq-ind}. \kat{can you be more explicit here? Do you mean that only the consumption of budget on the raw data will be taken into account? And that the queries over the results do not count?}
\begin{example}