the-last-thing/text/preliminaries/correlation.tex

231 lines
18 KiB
TeX

\section{Data dependence and correlation}
\label{sec:correlation}
\subsection{Types of correlation}
The most prominent types of correlations might be:
\begin{itemize}
\item \emph{temporal}~\cite{wei2006time}---appearing in observations (i.e.,~values) of the same object over time.
\item \emph{Spatial}~\cite{legendre1993spatial, anselin1995local}---denoted by the degree of similarity of nearby data points in space, and indicating if and how phenomena relate to the (broader) area where they take place.
\item \emph{Spatiotemporal}---a combination of the previous categories, appearing when processing time series or sequences of human activities with geolocation characteristics, e.g.,~\cite{ghinita2009preventing}.
\end{itemize}
Contrary to one-dimensional correlations, spatial correlation is multi-dimensional and multi-directional, and can be measured by indicators (e.g.,~\emph{Moran's I}~\cite{moran1950notes}) that reflect the \emph{spatial association} of the concerned data.
Spatial autocorrelation has its foundations in the \emph{First Law of Geography} stating that ``everything is related to everything else, but near things are more related than distant things''~\cite{tobler1970computer}.
A positive spatial autocorrelation indicates that similar data are \emph{clustered}, a negative that data are dispersed and are close to dissimilar ones, and when close to zero, that data are \emph{randomly arranged} in space.
\subsection{Extraction of correlation}
A common practice for extracting data dependencies from continuous data, is by expressing the data as a \emph{stochastic} or \emph{random process}.
A random process is a collection of \emph{random variables} or \emph{bivariate data}, indexed by some set, e.g.,~a series of timestamps, a Cartesian plane $\mathbb{R}^2$, an $n$-dimensional Euclidean space, etc.~\cite{skorokhod2005basic}.
The values a random variable can take are outcomes of an unpredictable process, while bivariate data are pairs of data values with a possible association between them.
Expressing data as stochastic processes allows their modeling depending on their properties, and thereafter the discovery of relevant data dependencies.
Some common stochastic processes modeling techniques include:
\begin{itemize}
\item \emph{Conditional probabilities}~\cite{allan2013probability}---probabilities of events in the presence of other events.
\item \emph{Conditional Random Fields} (CRFs)~\cite{lafferty2001conditional}---undirected graphs encoding conditional probability distributions.
\item \emph{Markov processes}~\cite{rogers2000diffusions}---stochastic processes for which the conditional probability of their future states depends only on the present state and it is independent of its previous states (\emph{Markov assumption}).
\begin{itemize}
\item \emph{Markov chains}~\cite{gagniuc2017markov}---sequences of possible events whose probability depends on the state attained in the previous event.
\item \emph{Hidden Markov Models} (HMMs)~\cite{baum1966statistical}---statistical Markov models of Markov processes with unobserved states.
\end{itemize}
\end{itemize}
\subsection{Privacy risks of correlation}
Data dependence might appear:
\begin{itemize}
\item within one data set, and
\item within one data set and among one data set and previous data releases, and/or other external sources~\cite{kifer2011no, chen2014correlated, liu2016dependence, zhao2017dependent}.
\end{itemize}
In the former case, data tuples and data values within a data set may be correlated, or linked in such a way that information about one person can be inferred even if the person is absent from the database.
Consequently, in this category we put assumptions made on the data generation model based on randomness, like the random world model, the independent and identically distributed data (i.i.d.) model, or the independent-tuples model, which may be unrealistic for many real-world scenarios.
This attack is also known as the \emph{deFinetti's attack}~\cite{kifer2009attacks}.
In the latter case, the strength of the dependence between a pair of variables can be quantified with the utilization of \emph{correlations}~\cite{stigler1989francis}.
Correlation implies dependence but not vice versa, however, the two terms are often used as synonyms.
The correlation among nearby observations, i.e.,~the elements in a series of data points, are referenced as \emph{autocorrelation} or \emph{serial correlation}~\cite{park2018fundamentals}.
Depending on the evaluation technique, e.g.,~\emph{Pearson's correlation coefficient}~\cite{stigler1989francis}, a correlation can be characterized as \emph{negative}, \emph{zero}, or \emph{positive}.
A negative value shows that the behavior of one variable is the \emph{opposite} of that of the other, e.g.,~when the one increases the other decreases.
Zero means that the variables are not linked and are \emph{independent} of each other.
A positive correlation indicates that the variables behave in a \emph{similar} manner, e.g.,~when the one decreases the other decreases as well.
\subsection{Privacy loss under temporal correlation}
% The presence of temporal correlations might result into additional privacy loss consisting of \emph{backward privacy loss} $\alpha^B$ and \emph{forward privacy loss} $\alpha^F$~\cite{cao2017quantifying}.
Cao et al.~\cite{cao2017quantifying} propose a method for computing the temporal privacy loss (TPL) of a differential privacy mechanism in the presence of temporal correlations and background knowledge.
The goal of their technique is to guarantee privacy protection and to bound the privacy loss at every timestamp under the assumption of independent data releases.
It calculates the temporal privacy loss as the sum of the backward and forward privacy loss minus the default privacy loss $\varepsilon$ of the mechanism (because it is counted twice in the aforementioned entities).
This calculation is done for each individual that is included in the original data set and the overall temporal privacy loss is equal to the maximum calculated value at every timestamp.
The backward/forward privacy loss at any timestamp depends on the backward/forward privacy loss at the previous/next timestamp, the backward/forward temporal correlations, and $\varepsilon$.
\begin{definition}
[Temporal privacy loss (TPL)]
\label{def:tpl}
The potential privacy loss of a privacy mechanism at a timestamp $t \leq T$ due to a series of outputs $\pmb{o}_1$, \dots, $\pmb{o}_T$ and temporal correlations in its input $D_t$ with respect to any adversary, targeting an individual with potential data items $x_t$ (or $x'_t$) and having knowledge $\mathbb{D}_t$ equal to $D_t - \{x_t\}$ (or $D'_t - \{x'_t\}$), is defined as:
\begin{equation}
\label{eq:tpl}
\alpha_t = \sup_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_T} \ln \frac{\Pr[\pmb{o}_1, \dots, \pmb{o}_T | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_1, \dots, \pmb{o}_T | x'_t, \mathbb{D}_t]}
\end{equation}
\end{definition}
%
By analyzing Equation~\ref{eq:tpl} we get the following:
\begin{align}
\label{eq:tpl-1}
(\ref{eq:tpl}) = & \underbrace{\sup_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_1, \dots, \pmb{o}_t| x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_1, \dots, \pmb{o}_t | x'_t, \mathbb{D}_t]}}_{\text{Backward privacy loss}\ (\alpha^B_t)} \nonumber \\
& + \underbrace{\sup_{x_t, x'_t, \pmb{o}_t, \dots, \pmb{o}_T} \ln \frac{\Pr[\pmb{o}_t, \dots, \pmb{o}_T| x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t, \dots, \pmb{o}_T | x'_t, \mathbb{D}_t]}}_{\text{Forward privacy loss}\ (\alpha^F_t)} \nonumber \\
& - \underbrace{\sup_{x_t, x'_t, \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t] }{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]}}_{\text{Present privacy loss}\ (\varepsilon_t)}
\end{align}
\begin{definition}
[Backward privacy loss (BPL)]
\label{def:bpl}
The potential privacy loss of a privacy mechanism at a timestamp $t \leq T$ due to outputs $\pmb{o}_1$, \dots, $\pmb{o}_t$ and temporal correlations in its input $D_t$ with respect to any adversary, targeting an individual with potential data items $x_t$ (or $x'_t$) and having knowledge $\mathbb{D}_t$ equal to $D_t - \{x_t\}$ (or $D'_t - \{x'_t\}$), is called backward privacy loss and is defined as:
\begin{equation}
\label{eq:bpl-1}
\alpha^B_t = \sup_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_1, \dots, \pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_1, \dots, \pmb{o}_t | x'_t, \mathbb{D}_t]}
\end{equation}
\end{definition}
%
From differential privacy we have the assumption that $\pmb{o}_1$, \dots, $\pmb{o}_t$ are independent events.
Therefore, according to the Bayesian theorem, we can write Equation~\ref{eq:bpl-1} as:
\begin{align}
\label{eq:bpl-2}
(\ref{eq:bpl-1}) = & \sup_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1}| x_t, \mathbb{D}_t] \Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x'_t, \mathbb{D}_t] \Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]} \nonumber \\
= & \sup_{x_t, x_t', \pmb{o}_1, \dots, \pmb{o}_{t - 1}} \ln \frac{\Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x_t', \mathbb{D}_t]} \nonumber \\
& + \sup_{x_t, x_t', \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]}
\end{align}
%
Applying the law of total probability to the first term of Equation~\ref{eq:bpl-2} for all the possible data $x_{t - 1}$ (or $x'_{t - 1}$) and $\mathbb{D}_{t - 1}$ we get the following:
\begin{align}
\label{eq:bpl-3}
(\ref{eq:bpl-2}) = &
\adjustbox{max width=0.9\linewidth}{
$\sup\limits_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_{t - 1}} \ln \frac{\sum\limits_{x_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x_t, \mathbb{D}_t, x_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x_{t - 1}, \mathbb{D}_{t - 1} | x_t, \mathbb{D}_t]}{\sum\limits_{x'_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x'_t, \mathbb{D}_t, x'_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x'_{t - 1}, \mathbb{D}_{t - 1} | x'_t, \mathbb{D}_t]}$
} \nonumber \\
& \adjustbox{max width=0.3\linewidth}{
$+ \sup\limits_{x_t, x_t', \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]}$
}
\end{align}
%
Since $\mathbb{D}_t$ is equal to $D_t - \{x_t\}$ (or $D'_t - \{x'_t\}$), and thus is constant and independent of every possible $x_t$ (or $x'_t$), $\forall t \leq T$, Equation~\ref{eq:bpl-3} can be written as:
\begin{align}
\label{eq:bpl-4}
(\ref{eq:bpl-3}) = &
\adjustbox{max width=0.9\linewidth}{
$\sup\limits_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_{t - 1}} \ln \frac{\sum\limits_{x_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x_t, \mathbb{D}_t, x_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x_{t - 1} | x_t, \mathbb{D}_t] \Pr[\mathbb{D}_{t - 1} | x_t, \mathbb{D}_t]}{\sum\limits_{x'_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x'_t, \mathbb{D}_t, x'_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x'_{t - 1} | x'_t, \mathbb{D}_t] \Pr[\mathbb{D}_{t - 1} | x'_t, \mathbb{D}_t]}$
} \nonumber \\
& \adjustbox{max width=0.275\linewidth}{
$+ \sup\limits_{x_t, x_t', \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]}$
} \nonumber \\
= & \adjustbox{max width=0.825\linewidth}{
$\sup\limits_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_{t - 1}} \ln \frac{\sum\limits_{x_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x_t, \mathbb{D}_t, x_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x_{t - 1} | x_t] \Pr[\mathbb{D}_{t - 1} | \mathbb{D}_t]}{\sum\limits_{x'_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x'_t, \mathbb{D}_t, x'_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x'_{t - 1} | x'_t] \Pr[\mathbb{D}_{t - 1} | \mathbb{D}_t]}$
} \nonumber \\
& \adjustbox{max width=0.275\linewidth}{
$+ \sup\limits_{x_t, x_t', \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]}$
} \nonumber \\
= & \adjustbox{max width=0.7\linewidth}{
$\sup\limits_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_{t - 1}} \ln \frac{\sum\limits_{x_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x_t, \mathbb{D}_t, x_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x_{t - 1} | x_t]}{\sum\limits_{x'_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x'_t, \mathbb{D}_t, x'_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x'_{t - 1} | x'_t]}$
} \nonumber \\
& \adjustbox{max width=0.275\linewidth}{
$+ \sup\limits_{x_t, x_t', \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]}$
}
\end{align}
%
The outputs $\pmb{o}_1$, \dots, $\pmb{o}_t$ and $x_t$ (or $x'_t$) are conditionally independent in the presence of
$x_{t - 1}$ (or $x'_{t - 1}$), and thus Equation~\ref{eq:bpl-4} can be written as:
\begin{align}
\label{eq:bpl-5}
(\ref{eq:bpl-4}) = &
\adjustbox{max width=0.9\linewidth}{
$\sup\limits_{x_t, x'_t, \pmb{o}_1, \dots, \pmb{o}_{t - 1}} \ln \cfrac{\sum\limits_{x_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x_{t - 1}, \mathbb{D}_{t - 1}] \Pr[x_{t - 1} | x_t]}{\sum\limits_{x'_{t - 1}} \underbrace{\Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | x'_{t - 1}, \mathbb{D}_{t - 1}]}_{\alpha^B_{t - 1}} \underbrace{\Pr[x'_{t - 1} | x'_t]}_{P^B_{t - 1}}}$
} \nonumber \\
& \adjustbox{max width=0.4\linewidth}{
$+ \underbrace{\sup\limits_{x_t, x'_t, \pmb{o}_t} \ln \cfrac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]}}_{\varepsilon_t}$
}
\end{align}
\begin{definition}
[Forward privacy loss (FPL)]
\label{def:fpl}
The potential privacy loss of a privacy mechanism at a timestamp $t \leq T$ due to outputs $\pmb{o}_t$,\dots,$\pmb{o}_T$ and temporal correlations in its input $D_t$ with respect to any adversary, targeting an individual with potential data item $x_t$ (or $x'_t$) and having knowledge $\mathbb{D}_t$ equal to $D_t - \{x_t\}$ (or $D'_t - \{x'_t\}$), is called forward privacy loss and is defined as:
\begin{equation}
\label{eq:fpl-1}
\alpha^F_t = \sup_{x_t, x'_t, \pmb{o}_t, \dots, \pmb{o}_T} \ln \frac{\Pr[\pmb{o}_t, \dots, \pmb{o}_T | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t, \dots, \pmb{o}_T | x'_t, \mathbb{D}_t]}
\end{equation}
\end{definition}
%
Similar to the way that we concluded to Equation~\ref{eq:bpl-5} from Equation~\ref{eq:bpl-1} we can write Equation~\ref{eq:fpl-1} as follows:
\begin{align}
\label{eq:fpl-2}
(\ref{eq:fpl-1}) = &
\adjustbox{max width=0.9\linewidth}{
$\sup\limits_{x_t, x'_t, \pmb{o}_{t + 1}, \dots, \pmb{o}_T} \ln \cfrac{\sum\limits_{x_{t + 1}} \Pr[\pmb{o}_{t + 1}, \dots, \pmb{o}_T | x_{t + 1}, \mathbb{D}_{t + 1}] \Pr[x_{t + 1} | x_t]}{\sum\limits_{x'_{t + 1}} \underbrace{\Pr[\pmb{o}_{t + 1}, \dots, \pmb{o}_T | x'_{t + 1}, \mathbb{D}_{t + 1}]}_{\alpha^F_{t + 1}} \underbrace{\Pr[x'_{t + 1} | x'_t]}_{P^F_{t + 1}}}$
} \nonumber \\
& \adjustbox{max width=0.4\linewidth}{
$+ \underbrace{\sup\limits_{x_t, x'_t, \pmb{o}_t} \ln \cfrac{\Pr[\pmb{o}_t | x_t, \mathbb{D}_t]}{\Pr[\pmb{o}_t | x'_t, \mathbb{D}_t]}}_{\varepsilon_t}$
}
\end{align}
Equations~\ref{eq:tpl-1},~\ref{eq:bpl-5}, and~\ref{eq:fpl-2} apply to the global publishing schema.
In the local schema, $D$ (or $D'$) is a single data item and is the same with $x$ (or $x'$), i.e.,~the possible data item of an individual user.
Therefore, we calculate the extra privacy loss under temporal correlations, due to an adversary that targets a user at a timestamp $t$, based on the assumption that their possible data are $D_t$ or $D'_t$.
More specifically, the calculation of TPL (Equation~\ref{eq:tpl-1}) becomes:
\begin{align}
\label{eq:tpl-local}
& \underbrace{\sup_{D_t, D'_t, \pmb{o}_1, \dots, \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_1, \dots, \pmb{o}_t| D_t]}{\Pr[\pmb{o}_1, \dots, \pmb{o}_t | D'_t]}}_{\text{Backward privacy loss}\ (\alpha^B_t)} \nonumber \\
& + \underbrace{\sup_{D_t, D'_t, \pmb{o}_t, \dots, \pmb{o}_T} \ln \frac{\Pr[\pmb{o}_t, \dots, \pmb{o}_T| D_t]}{\Pr[\pmb{o}_t, \dots, \pmb{o}_T | D'_t]}}_{\text{Forward privacy loss}\ (\alpha^F_t)} \nonumber \\
& - \underbrace{\sup_{D_t, D'_t, \pmb{o}_t} \ln \frac{\Pr[\pmb{o}_t | D_t]}{\Pr[\pmb{o}_t | D'_t]}}_{\text{Present privacy loss}\ (\varepsilon_t)}
\end{align}
%
The calculation of BPL (Equation~\ref{eq:bpl-5}) becomes:
\begin{align}
\label{eq:bpl-local}
& \adjustbox{max width=0.9\linewidth}{
$\sup\limits_{D_t, D'_t, \pmb{o}_1, \dots, \pmb{o}_{t - 1}} \ln \cfrac{\sum\limits_{D_{t - 1}} \Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | D_{t - 1}, ] \Pr[D_{t - 1} | D_t]}{\sum\limits_{D'_{t - 1}} \underbrace{\Pr[\pmb{o}_1, \dots, \pmb{o}_{t - 1} | D'_{t - 1}, ]}_{\alpha^B_{t - 1}} \underbrace{\Pr[D'_{t - 1} | D'_t]}_{P^B_{t - 1}}}$
} \nonumber \\
& \adjustbox{max width=0.4\linewidth}{
$+ \underbrace{\sup\limits_{D_t, D'_t, \pmb{o}_t} \ln \cfrac{\Pr[\pmb{o}_t | D_t]}{\Pr[\pmb{o}_t | D'_t]}}_{\varepsilon_t}$
}
\end{align}
%
The calculation of FPL (Equation~\ref{eq:fpl-2}) becomes:
\begin{align}
\label{eq:fpl-local}
& \adjustbox{max width=0.9\linewidth}{
$\sup\limits_{D_t, D'_t, \pmb{o}_{t + 1}, \dots, \pmb{o}_T} \ln \cfrac{\sum\limits_{D_{t + 1}} \Pr[\pmb{o}_{t + 1}, \dots, \pmb{o}_T | D_{t + 1}] \Pr[D_{t + 1} | D_t]}{\sum\limits_{D'_{t + 1}} \underbrace{\Pr[\pmb{o}_{t + 1}, \dots, \pmb{o}_T | D'_{t + 1}]}_{\alpha^F_{t + 1}} \underbrace{\Pr[D'_{t + 1} | D'_t]}_{P^F_{t + 1}}}$
} \nonumber \\
& \adjustbox{max width=0.4\linewidth}{
$+ \underbrace{\sup\limits_{D_t, D'_t, \pmb{o}_t} \ln \cfrac{\Pr[\pmb{o}_t | D_t]}{\Pr[\pmb{o}_t | D'_t]}}_{\varepsilon_t}$
}
\end{align}
The authors propose solutions to bound the temporal privacy loss, under the presence of weak to moderate correlations, in both finite and infinite data publishing scenarios.
In the latter case, they try to find a value for $\varepsilon$ for which the backward and forward privacy loss are equal.
In the former, they similarly try to balance the backward and forward privacy loss while they allocate more $\varepsilon$ at the first and last timestamps, since they have higher impact to the privacy loss of the next and previous ones.
This way they achieve an overall constant temporal privacy loss throughout the time series.
According to the technique's intuition, stronger correlations result in higher privacy loss.
However, the loss is less when the dimension of the transition matrix, which is extracted according to the modeling of the correlations (in this work they use Markov chains), is greater due to the fact that larger transition matrices tend to be uniform, resulting in weaker data dependence.
The authors investigate briefly all of the possible privacy levels; however, the solutions that they propose are applied only on the event-level.
Last but not least, the technique requires the calculation of the temporal privacy loss for every individual within the data set that might prove computationally inefficient in real-time scenarios.