correlation: Reviewed

This commit is contained in:
Manos Katsomallos 2021-07-30 20:27:21 +03:00
parent 84d33dd7f3
commit d59cb3beb2

View File

@ -1,10 +1,10 @@
\section{Data dependence and correlation} \section{Data correlation}
\label{sec:correlation} \label{sec:correlation}
\subsection{Types of correlation} \subsection{Types of correlation}
The most prominent types of correlations might be: The most prominent types of correlation might be:
\begin{itemize} \begin{itemize}
\item \emph{temporal}~\cite{wei2006time}---appearing in observations (i.e.,~values) of the same object over time. \item \emph{temporal}~\cite{wei2006time}---appearing in observations (i.e.,~values) of the same object over time.
@ -12,17 +12,17 @@ The most prominent types of correlations might be:
\item \emph{Spatiotemporal}---a combination of the previous categories, appearing when processing time series or sequences of human activities with geolocation characteristics, e.g.,~\cite{ghinita2009preventing}. \item \emph{Spatiotemporal}---a combination of the previous categories, appearing when processing time series or sequences of human activities with geolocation characteristics, e.g.,~\cite{ghinita2009preventing}.
\end{itemize} \end{itemize}
Contrary to one-dimensional correlations, spatial correlation is multi-dimensional and multi-directional, and can be measured by indicators (e.g.,~\emph{Moran's I}~\cite{moran1950notes}) that reflect the \emph{spatial association} of the concerned data. Contrary to one-dimensional correlation, spatial correlation is multi-dimensional and multi-directional, and can be measured by indicators (e.g.,~\emph{Moran's I}~\cite{moran1950notes}) that reflect the \emph{spatial association} of the concerned data.
Spatial autocorrelation has its foundations in the \emph{First Law of Geography} stating that ``everything is related to everything else, but near things are more related than distant things''~\cite{tobler1970computer}. Spatial autocorrelation has its foundations in the \emph{First Law of Geography} stating that ``everything is related to everything else, but near things are more related than distant things''~\cite{tobler1970computer}.
A positive spatial autocorrelation indicates that similar data are \emph{clustered}, a negative that data are dispersed and are close to dissimilar ones, and when close to zero, that data are \emph{randomly arranged} in space. A positive spatial autocorrelation indicates that similar data are \emph{clustered}, a negative that data are dispersed and are close to dissimilar ones, and when close to zero, that data are \emph{randomly arranged} in space.
\subsection{Extraction of correlation} \subsection{Extraction of correlation}
A common practice for extracting data dependencies from continuous data, is by expressing the data as a \emph{stochastic} or \emph{random process}. A common practice for extracting data dependence from continuous data, is by expressing the data as a \emph{stochastic} or \emph{random process}.
A random process is a collection of \emph{random variables} or \emph{bivariate data}, indexed by some set, e.g.,~a series of timestamps, a Cartesian plane $\mathbb{R}^2$, an $n$-dimensional Euclidean space, etc.~\cite{skorokhod2005basic}. A random process is a collection of \emph{random variables} or \emph{bivariate data}, indexed by some set, e.g.,~a series of timestamps, a Cartesian plane $\mathbb{R}^2$, an $n$-dimensional Euclidean space, etc.~\cite{skorokhod2005basic}.
The values a random variable can take are outcomes of an unpredictable process, while bivariate data are pairs of data values with a possible association between them. The values a random variable can take are outcomes of an unpredictable process, while bivariate data are pairs of data values with a possible association between them.
Expressing data as stochastic processes allows their modeling depending on their properties, and thereafter the discovery of relevant data dependencies. Expressing data as stochastic processes allows their modeling depending on their properties, and thereafter the discovery of relevant data dependence.
Some common stochastic processes modeling techniques include: Some common stochastic processes modeling techniques include:
@ -39,7 +39,7 @@ Some common stochastic processes modeling techniques include:
\subsection{Privacy risks of correlation} \subsection{Privacy risks of correlation}
Data dependence might appear: Correlation appears in dependent data:
\begin{itemize} \begin{itemize}
\item within one data set, and \item within one data set, and
@ -50,7 +50,7 @@ In the former case, data tuples and data values within a data set may be correla
Consequently, in this category we put assumptions made on the data generation model based on randomness, like the random world model, the independent and identically distributed data (i.i.d.) model, or the independent-tuples model, which may be unrealistic for many real-world scenarios. Consequently, in this category we put assumptions made on the data generation model based on randomness, like the random world model, the independent and identically distributed data (i.i.d.) model, or the independent-tuples model, which may be unrealistic for many real-world scenarios.
This attack is also known as the \emph{deFinetti's attack}~\cite{kifer2009attacks}. This attack is also known as the \emph{deFinetti's attack}~\cite{kifer2009attacks}.
In the latter case, the strength of the dependence between a pair of variables can be quantified with the utilization of \emph{correlations}~\cite{stigler1989francis}. In the latter case, the strength of the dependence between a pair of variables can be quantified with the utilization of \emph{correlation}~\cite{stigler1989francis}.
Correlation implies dependence but not vice versa, however, the two terms are often used as synonyms. Correlation implies dependence but not vice versa, however, the two terms are often used as synonyms.
The correlation among nearby observations, i.e.,~the elements in a series of data points, are referenced as \emph{autocorrelation} or \emph{serial correlation}~\cite{park2018fundamentals}. The correlation among nearby observations, i.e.,~the elements in a series of data points, are referenced as \emph{autocorrelation} or \emph{serial correlation}~\cite{park2018fundamentals}.
Depending on the evaluation technique, e.g.,~\emph{Pearson's correlation coefficient}~\cite{stigler1989francis}, a correlation can be characterized as \emph{negative}, \emph{zero}, or \emph{positive}. Depending on the evaluation technique, e.g.,~\emph{Pearson's correlation coefficient}~\cite{stigler1989francis}, a correlation can be characterized as \emph{negative}, \emph{zero}, or \emph{positive}.
@ -61,17 +61,17 @@ A positive correlation indicates that the variables behave in a \emph{similar} m
\subsection{Privacy loss under temporal correlation} \subsection{Privacy loss under temporal correlation}
% The presence of temporal correlations might result into additional privacy loss consisting of \emph{backward privacy loss} $\alpha^B$ and \emph{forward privacy loss} $\alpha^F$~\cite{cao2017quantifying}. % The presence of temporal correlation might result into additional privacy loss consisting of \emph{backward privacy loss} $\alpha^B$ and \emph{forward privacy loss} $\alpha^F$~\cite{cao2017quantifying}.
Cao et al.~\cite{cao2017quantifying} propose a method for computing the temporal privacy loss (TPL) of a differential privacy mechanism in the presence of temporal correlations and background knowledge. Cao et al.~\cite{cao2017quantifying} propose a method for computing the temporal privacy loss (TPL) of a differential privacy mechanism in the presence of temporal correlation and background knowledge.
The goal of their technique is to guarantee privacy protection and to bound the privacy loss at every timestamp under the assumption of independent data releases. The goal of their technique is to guarantee privacy protection and to bound the privacy loss at every timestamp under the assumption of independent data releases.
It calculates the temporal privacy loss as the sum of the backward and forward privacy loss minus the default privacy loss $\varepsilon$ of the mechanism (because it is counted twice in the aforementioned entities). It calculates the temporal privacy loss as the sum of the backward and forward privacy loss minus the default privacy loss $\varepsilon$ of the mechanism (because it is counted twice in the aforementioned entities).
This calculation is done for each individual that is included in the original data set and the overall temporal privacy loss is equal to the maximum calculated value at every timestamp. This calculation is done for each individual that is included in the original data set and the overall temporal privacy loss is equal to the maximum calculated value at every timestamp.
The backward/forward privacy loss at any timestamp depends on the backward/forward privacy loss at the previous/next timestamp, the backward/forward temporal correlations, and $\varepsilon$. The backward/forward privacy loss at any timestamp depends on the backward/forward privacy loss at the previous/next timestamp, the backward/forward temporal correlation, and $\varepsilon$.
\begin{definition} \begin{definition}
[Temporal privacy loss (TPL)] [Temporal privacy loss (TPL)]
\label{def:tpl} \label{def:tpl}
The potential privacy loss of a privacy mechanism at a timestamp $t \leq T$ due to a series of outputs $\pmb{o}_1$, \dots, $\pmb{o}_T$ and temporal correlations in its input $D_t$ with respect to any adversary, targeting an individual with potential data items $x_t$ (or $x'_t$) and having knowledge $\mathbb{D}_t$ equal to $D_t - \{x_t\}$ (or $D'_t - \{x'_t\}$), is defined as: The potential privacy loss of a privacy mechanism at a timestamp $t \leq T$ due to a series of outputs $\pmb{o}_1$, \dots, $\pmb{o}_T$ and temporal correlation in its input $D_t$ with respect to any adversary, targeting an individual with potential data items $x_t$ (or $x'_t$) and having knowledge $\mathbb{D}_t$ equal to $D_t - \{x_t\}$ (or $D'_t - \{x'_t\}$), is defined as:
\begin{equation} \begin{equation}
\label{eq:tpl} \label{eq:tpl}
@ -91,7 +91,7 @@ By analyzing Equation~\ref{eq:tpl} we get the following:
\begin{definition} \begin{definition}
[Backward privacy loss (BPL)] [Backward privacy loss (BPL)]
\label{def:bpl} \label{def:bpl}
The potential privacy loss of a privacy mechanism at a timestamp $t \leq T$ due to outputs $\pmb{o}_1$, \dots, $\pmb{o}_t$ and temporal correlations in its input $D_t$ with respect to any adversary, targeting an individual with potential data items $x_t$ (or $x'_t$) and having knowledge $\mathbb{D}_t$ equal to $D_t - \{x_t\}$ (or $D'_t - \{x'_t\}$), is called backward privacy loss and is defined as: The potential privacy loss of a privacy mechanism at a timestamp $t \leq T$ due to outputs $\pmb{o}_1$, \dots, $\pmb{o}_t$ and temporal correlation in its input $D_t$ with respect to any adversary, targeting an individual with potential data items $x_t$ (or $x'_t$) and having knowledge $\mathbb{D}_t$ equal to $D_t - \{x_t\}$ (or $D'_t - \{x'_t\}$), is called backward privacy loss and is defined as:
\begin{equation} \begin{equation}
\label{eq:bpl-1} \label{eq:bpl-1}
@ -165,7 +165,7 @@ $x_{t - 1}$ (or $x'_{t - 1}$), and thus Equation~\ref{eq:bpl-4} can be written a
\begin{definition} \begin{definition}
[Forward privacy loss (FPL)] [Forward privacy loss (FPL)]
\label{def:fpl} \label{def:fpl}
The potential privacy loss of a privacy mechanism at a timestamp $t \leq T$ due to outputs $\pmb{o}_t$,\dots,$\pmb{o}_T$ and temporal correlations in its input $D_t$ with respect to any adversary, targeting an individual with potential data item $x_t$ (or $x'_t$) and having knowledge $\mathbb{D}_t$ equal to $D_t - \{x_t\}$ (or $D'_t - \{x'_t\}$), is called forward privacy loss and is defined as: The potential privacy loss of a privacy mechanism at a timestamp $t \leq T$ due to outputs $\pmb{o}_t$,\dots,$\pmb{o}_T$ and temporal correlation in its input $D_t$ with respect to any adversary, targeting an individual with potential data item $x_t$ (or $x'_t$) and having knowledge $\mathbb{D}_t$ equal to $D_t - \{x_t\}$ (or $D'_t - \{x'_t\}$), is called forward privacy loss and is defined as:
\begin{equation} \begin{equation}
\label{eq:fpl-1} \label{eq:fpl-1}
@ -188,7 +188,7 @@ Similar to the way that we concluded to Equation~\ref{eq:bpl-5} from Equation~\r
Equations~\ref{eq:tpl-1},~\ref{eq:bpl-5}, and~\ref{eq:fpl-2} apply to the global publishing schema. Equations~\ref{eq:tpl-1},~\ref{eq:bpl-5}, and~\ref{eq:fpl-2} apply to the global publishing schema.
In the local schema, $D$ (or $D'$) is a single data item and is the same with $x$ (or $x'$), i.e.,~the possible data item of an individual user. In the local schema, $D$ (or $D'$) is a single data item and is the same with $x$ (or $x'$), i.e.,~the possible data item of an individual user.
Therefore, we calculate the extra privacy loss under temporal correlations, due to an adversary that targets a user at a timestamp $t$, based on the assumption that their possible data are $D_t$ or $D'_t$. Therefore, we calculate the extra privacy loss under temporal correlation, due to an adversary that targets a user at a timestamp $t$, based on the assumption that their possible data are $D_t$ or $D'_t$.
More specifically, the calculation of TPL (Equation~\ref{eq:tpl-1}) becomes: More specifically, the calculation of TPL (Equation~\ref{eq:tpl-1}) becomes:
\begin{align} \begin{align}
\label{eq:tpl-local} \label{eq:tpl-local}
@ -219,12 +219,12 @@ The calculation of FPL (Equation~\ref{eq:fpl-2}) becomes:
} }
\end{align} \end{align}
The authors propose solutions to bound the temporal privacy loss, under the presence of weak to moderate correlations, in both finite and infinite data publishing scenarios. The authors propose solutions to bound the temporal privacy loss, under the presence of weak to moderate correlation, in both finite and infinite data publishing scenarios.
In the latter case, they try to find a value for $\varepsilon$ for which the backward and forward privacy loss are equal. In the latter case, they try to find a value for $\varepsilon$ for which the backward and forward privacy loss are equal.
In the former, they similarly try to balance the backward and forward privacy loss while they allocate more $\varepsilon$ at the first and last timestamps, since they have higher impact to the privacy loss of the next and previous ones. In the former, they similarly try to balance the backward and forward privacy loss while they allocate more $\varepsilon$ at the first and last timestamps, since they have higher impact to the privacy loss of the next and previous ones.
This way they achieve an overall constant temporal privacy loss throughout the time series. This way they achieve an overall constant temporal privacy loss throughout the time series.
According to the technique's intuition, stronger correlations result in higher privacy loss. According to the technique's intuition, stronger correlation result in higher privacy loss.
However, the loss is less when the dimension of the transition matrix, which is extracted according to the modeling of the correlations (in this work they use Markov chains), is greater due to the fact that larger transition matrices tend to be uniform, resulting in weaker data dependence. However, the loss is less when the dimension of the transition matrix, which is extracted according to the modeling of the correlation (in this work they use Markov chains), is greater due to the fact that larger transition matrices tend to be uniform, resulting in weaker data dependence.
The authors investigate briefly all of the possible privacy levels; however, the solutions that they propose are applied only on the event-level. The authors investigate briefly all of the possible privacy levels; however, the solutions that they propose are applied only on the event-level.
Last but not least, the technique requires the calculation of the temporal privacy loss for every individual within the data set that might prove computationally inefficient in real-time scenarios. Last but not least, the technique requires the calculation of the temporal privacy loss for every individual within the data set that might prove computationally inefficient in real-time scenarios.