problem: Review thething

This commit is contained in:
Manos Katsomallos 2021-10-25 01:27:48 +02:00
parent d6a97d81fe
commit aebb0123ba
9 changed files with 142 additions and 148 deletions

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

View File

@ -1,6 +1,5 @@
\subsection{Contribution} \subsection{Contribution}
\label{subsec:lmdk-contrib} \label{subsec:lmdk-contrib}
In this section, we formally define a novel privacy notion that we call \emph{{\thething} privacy}. In this section, we formally define a novel privacy notion that we call \emph{{\thething} privacy}.
We apply this privacy notion to time series consisting of \emph{{\thethings}} and regular events, and we design and implement three {\thething} privacy schemes. We apply this privacy notion to time series consisting of \emph{{\thethings}} and regular events, and we design and implement three {\thething} privacy schemes.
We further study {\thething} privacy under temporal correlation that is inherent in time series publishing. We investigate {\thething} privacy under temporal correlation, which is inherent in time series publishing, and discuss how {\thethings} can affect the propagation of temporal privacy loss.

View File

@ -1,10 +1,15 @@
\section{Significant events} \section{Significant events}
\label{sec:thething} \label{sec:thething}
The privacy mechanisms for the user, w-event and event levels that are already proposed in the literature, assume that in a time series any single event, or any sequence of events, or the entire series of events is equally privacy-significant for the users. The privacy mechanisms for the user, $w$-event, and event levels that are already proposed in the literature, assume that in a time series any single event, or any sequence of events, or the entire series of events is equally privacy-significant for the users.
In reality, this is a simplistic\kat{I would not say simplistic, but unrealistic assumption that deteriorates unnecessarily the quality of the perturbed data} assumption. In reality, this is
The fact that an event is significant, can be related to certain user-defined privacy criteria, or to its adjacent events, as well as to the entire time series. % a simplistic
We term significant events as \emph{{\thething} events} or simply \emph{\thethings}, following relevant literature\kat{can you find some other work that uses the same term? otherwise one can raise the question why not ot use the word significant }. % \kat{I would not say simplistic, but unrealistic assumption that deteriorates unnecessarily the quality of the perturbed data}
an assumption that deteriorates unnecessarily the utility of the released data.
The significance of an event is related to certain user-defined privacy criteria, or to its adjacent events, as well as to the entire time series.
We term significant events as \emph{{\thething} events} or simply \emph{\thethings}, following relevant literature~\cite{gaskell2000telescoping}.
% \kat{can you find some other work that uses the same term? otherwise one can raise the question why not ot use the word significant }
% \mk{OK, but then again `significant privacy doesn't' sound great}
Identifying {\thethings} in time series can be done in an automatic or manual way. Identifying {\thethings} in time series can be done in an automatic or manual way.
For example, in spatiotemporal data, \emph{places where an individual spent some time} denote \emph{points of interest} (POIs) (called also stay points)~\cite{zheng2015trajectory}. For example, in spatiotemporal data, \emph{places where an individual spent some time} denote \emph{points of interest} (POIs) (called also stay points)~\cite{zheng2015trajectory}.
@ -15,39 +20,39 @@ This can be practical in decease control~\cite{eames2003contact}, similar to the
Last but not least, {\thethings} in \emph{smart grid} electricity usage patterns may not only reveal the energy consumption of a user but also information regarding activities, e.g.,~`at work', `sleeping', etc., or types of appliances already installed or recently purchased~\cite{khurana2010smart}. Last but not least, {\thethings} in \emph{smart grid} electricity usage patterns may not only reveal the energy consumption of a user but also information regarding activities, e.g.,~`at work', `sleeping', etc., or types of appliances already installed or recently purchased~\cite{khurana2010smart}.
We stress out that {\thething} identification is an orthogonal problem to ours, and that we consider {\thethings} given as input to our problem. We stress out that {\thething} identification is an orthogonal problem to ours, and that we consider {\thethings} given as input to our problem.
We argue that protecting only {\thething} events along with any regular event release -- instead of protecting every event in the timeseries -- is sufficient for the user's protection, while it improves data utility. We argue that protecting only {\thething} events along with any regular event is sufficient for the user privacy protection, while it improves data utility with respect to the conventional user-level privacy.
More specifically, important events are adequately protected, while less important ones are not excessively perturbed. \kat{something feels wrong with this statement, because in figure 2 regular and landmarks seem to receive the same amount of noise..} Considering {\thethings} can prevent over-perturbing the data in the benefit of their final utility.
%In fact, considering {\thething} events can prevent over-perturbing the data in the benefit of their final quality. Revisiting the scenario in Figure~\ref{fig:st-cont}, if we want to protect the {\thething} points, we have to allocate at most a budget of $\varepsilon$ to the {\thethings}, while saving some for the release of regular events.
Take for example the scenario in Figure~\ref{fig:st-cont}, where {\thethings} are highlighted in gray. Essentially, the more budget we allocate to an event the less we protect it, but at the same time the more we maintain its utility.
If we want to protect the {\thething} points, we have to allocate at most a budget of $\varepsilon$ to the {\thethings}, while saving some for the release of regular events. With {\thething} privacy we propose to distribute the budget by accounting only for the {\thethings} when we release an event of the time series, i.e.,~allocating $\frac{\varepsilon}{5}$ ($4$ {\thethings} $+ 1$ regular point) to each event (see Figure~\ref{fig:st-cont}).
Essentially, the more budget we allocate to an event the less we protect it, but at the same time we maintain its utility. This way, we still guarantee
With {\thething} privacy we propose to distribute the budget taking into account only the existence of the {\thethings} when we release an event of the time series, i.e.,~allocating $\frac{\varepsilon}{5}$ ($4\ \text{\thethings} + 1\ \text{regular point}$) to each event (see Figure~\ref{fig:st-cont}). % \footnote{$\varepsilon$-differential privacy guarantees that the allocated budget should be less or equal to $\varepsilon$, and not precisely how much.
This way, we still guarantee\footnote{$\varepsilon$-differential privacy guarantees that the allocated budget should be less or equal to $\varepsilon$, and not precisely how much.\kat{Mano check.}} that the {\thethings} are adequately protected, as they receive a total budget of $\frac{4\varepsilon}{5}<\varepsilon$. % \kat{Mano check.}
At the same time, we avoid over-perturbing the regular events, as we allocate to them a higher total budget ($\frac{4\varepsilon}{5}$) compared to the user-level scenario ($\frac{\varepsilon}{2}$), and thus less noise. % \mk{It's not clear what you want to say}
% }
that the {\thethings} are adequately protected, as they receive a total budget of $\frac{4\varepsilon}{5} < \varepsilon$.
At the same time, we avoid over-perturbing the regular events, as we allocate to them a higher total budget ($\frac{4\varepsilon}{5}$) than in user-level ($\frac{\varepsilon}{2}$), and thus less noise.
Hence, at any timestamp we achieve an overall privacy protection bounded by $\varepsilon$ in the event set consisting of the released event and the {\thethings}.
\begin{example} \begin{example}
\label{ex:st-cont} \label{ex:st-cont}
Continuing Example~\ref{ex:scenario}, Bob cares about protecting his {\thethings} ($p_1$, $p_3$, $p_5$, $p_8$) along with every release that he makes, however he is not equally interested for the other regular events in his trajectory.
Figure~\ref{fig:st-cont} shows the case when we want to protect all of Bob's significant events ($p_1$, $p_3$, $p_5$, $p_8$) in his trajectory shown in Figure~\ref{fig:scenario}. More technically, he cares about allocating a total budget of $\varepsilon$ on any set of timestamps containing the {\thethings} and one regular event.
% That is, we have to allocate privacy budget $\varepsilon$ such that at any timestamp $t$ it holds that $\varepsilon_t + \varepsilon_1 + \varepsilon_3 + \varepsilon_5 + \varepsilon_8 \leq \varepsilon$. Event-level protection is not suitable for this case, since it can only protect one event at a time.
In this scenario, event-level protection is not suitable since it can only protect one event at a time. So, let us assume that we apply user-level privacy\footnote{In this scenario, in order to protect all the {\thethings} from timestamp $1$ to $8$, $w$ must be set to $8$, which makes $w$-event privacy equivalent to user-level.}, by distributing equal portions of $\varepsilon$ to all the events, i.e.,~$\frac{\varepsilon}{8}$ to each one (see Figure~\ref{fig:st-cont}).
Hence, we have to apply user-level privacy protection by distributing equal portions of $\varepsilon$ to all the events, i.e.,~$\frac{\varepsilon}{8}$ to each one (the equivalent of applying $8$-event privacy). Indeed, we have protected the {\thething} points plus one regular event at any release as expected; we have allocated a total of $\frac{5\varepsilon}{8}<\varepsilon$ to these $5$ events.
In this way, we have protected the {\thething} points; we have allocated a total of $\frac{\varepsilon}{2}<\varepsilon$ to the {\thethings}.
\begin{figure}[htp]
\begin{figure}[htp] \centering
\centering \includegraphics[width=.75\linewidth]{problem/st-cont}
\includegraphics[width=\linewidth]{problem/st-cont} \caption{User-level and {\thething} $\varepsilon$-differential privacy protection for the time series of Figure~\ref{fig:lmdk-scenario}.}
\caption{User-level and {\thething} $\varepsilon$-differential privacy protection for the time series of Figure~\ref{fig:scenario}.} \label{fig:st-cont}
\label{fig:st-cont} \end{figure}
\end{figure}
However, perturbing by $\frac{\varepsilon}{8}$ each one of the regular points deteriorates the data utility unnecessarily; any budget lower than or equal to $\frac{4\varepsilon}{8}$ would be sufficient for covering the user privacy requirements.
However, perturbing by $\frac{\varepsilon}{8}$ each regular point deteriorates the data utility unnecessarily. On the other hand, our proposed privacy model, {\thething} privacy, directly considers only the $5$ events of interest ($4$ {\thethings} $+ 1$ current event) in every release, thus changing the scope from all the time series to a significant subset of events.
Notice that the overall privacy budget that we ended up allocating to the user-defined significant events is equal to $\frac{\varepsilon}{2}$ and leaves an equal amount of budget to distribute to any current event. Subsequently, it allocates $\frac{\varepsilon}{5}$ to each one of these events.
In other words, uniformly allocating $\frac{\varepsilon}{5}$ to every event would still achieve the Bob's privacy goal, i.e.,~protect every significant event, while achieving better utility overall. Consequently, we still achieve to protect all the significant events, while the utility of a perturbed event is higher than in the case of user-level privacy ($\frac{\varepsilon}{5}>\frac{\varepsilon}{8}$).
\end{example} \end{example}
\input{problem/thething/contribution} \input{problem/thething/contribution}

View File

@ -1,5 +1,7 @@
\subsection{Problem definition} \subsection{Problem definition}
\label{subsec:lmdk-prob} \label{subsec:lmdk-prob}
In this section, we introduce a new privacy definition.
\subsubsection{Setting} \subsubsection{Setting}
\label{subsec:lmdk-set} \label{subsec:lmdk-set}
@ -10,18 +12,17 @@ Data are produced as a series of events, which we call time series.
An \emph{event} is defined as a triple of an identifying attribute of an individual and the possibly sensitive data at a timestamp. An \emph{event} is defined as a triple of an identifying attribute of an individual and the possibly sensitive data at a timestamp.
%This workflow is repeated in a continuous manner, producing series of events, which we call time series. %This workflow is repeated in a continuous manner, producing series of events, which we call time series.
%, producing, processing, publishing, and consuming events in a private manner. %, producing, processing, publishing, and consuming events in a private manner.
%\kat{keep only the terms with a small description.}
\begin{enumerate}[(i)] \begin{enumerate}[(i)]
\item \textbf{Data generators} (users) entity $E_g$ interacts with a crowdsensing application and produces continuously privacy-sensitive data items in an arbitrary frequency during the application's usage period $T = (t)_{t \in \mathbb{N}}$. \item \textbf{Data generators} (users) entity $E_g$ interacts with a crowdsensing application and produces continuously privacy-sensitive data items in an arbitrary frequency during the application's usage period $T = (i)_{i \in \mathbb{N}}$.
Thus, at each timestamp $t$, $E_g$ generates a data set $D_t \in \mathcal{D}$ where each of its members contributes a single data item. Thus, at each timestamp $t$, $E_g$ generates a data set $D_i \in \mathcal{D}$ where each of its members contributes a single data item.
\item \textbf{Data publishers} (trusted non-adversarial) entity $E_p$ receives the data sent by $E_g$ in the form of a series of events in $T$. \item \textbf{Data publishers} (trusted non-adversarial) entity $E_p$ receives the data sent by $E_g$ in the form of a series of events in $T$.
Following the \emph{global} processing and publishing scheme, $E_p$ collects at $t$ a data set $D_t$ and privacy-protects it by applying the respective privacy mechanism $\mathcal{M}_t$. Following the \emph{global} processing and publishing scheme, $E_p$ collects at $t$ a data set $D_i$ and privacy-protects it by applying the respective privacy mechanism $\mathcal{M}_i$.
$\mathcal{M}_t$ uses independent randomness such that it satisfies $\varepsilon_t$-differential privacy. $\mathcal{M}_i$ uses independent randomness such that it satisfies $\varepsilon_i$-differential privacy.
\item \textbf{Data consumers} (possibly adversarial) entity $E_c$ receives the result $\mathbf{o}_t$ of the privacy-preserving processing of $D_t$ by $E_p$. \item \textbf{Data consumers} (possibly adversarial) entity $E_c$ receives the result $\mathbf{o}_i$ of the privacy-preserving processing of $D_i$ by $E_p$.
According to Theorem~\ref{theor:compo-seq-ind}, the overall privacy guarantee of the outputs of $\mathcal{M}$ is equal to the sum of all the privacy budgets of the respective privacy mechanisms that compose $\mathcal{M}$, i.e.,~$\sum_{t \in T}\varepsilon_t$. According to Theorem~\ref{theor:compo-seq-ind}, the overall privacy guarantee of the outputs of $\mathcal{M}$ is equal to the sum of all the privacy budgets of the respective privacy mechanisms that compose $\mathcal{M}$, i.e.,~$\sum_{i \in T}\varepsilon_i$.
\end{enumerate} \end{enumerate}
@ -31,13 +32,13 @@ Notice that, in a real life scenario, $E_g$ and $E_c$ might overlap with each ot
\subsubsection{Privacy goal} \subsubsection{Privacy goal}
\label{subsec:lmdk-goal} \label{subsec:lmdk-goal}
We argue that in continuous user-generated data publishing, events are not equally significant in terms of privacy.
We argue that in continuous user-generated data publishing, events are not equally `significant' in terms of privacy. We term a significant event---according to user- or data-related criteria---as a \emph{\thething}~event.
% We term a significant event---according to user- or data-related criteria---as a \emph{\thething}~event. The identification of {\thething} events can be performed manually or automatically, and is an orthogonal problem to ours.
The identification of {\thething} events can be performed manually or automatically~\cite{zhou2004discovering, hariharan2004project}, and is an orthogonal problem to this current work. % and we address it subsequently in Section~\ref{subsec:lmdk-sel-sol}.
In this work, we consider the {\thething} timestamps non-sensitive and provided by the user as input along with the privacy budget $\varepsilon$. First, we consider the {\thething} timestamps, i.e.,~their position in time, non-sensitive and provided by the user as input along with the privacy budget $\varepsilon$.
For example, events $p_1$, $p_3$, $p_5$, $p_8$ in Figure~\ref{fig:scenario} are {\thething} events. For example, events $p_1$, $p_3$, $p_5$, $p_8$ in Figure~\ref{fig:lmdk-scenario} are {\thething} events.
We give the definition of {\thethings} below (Definition~\ref{def:thething-evnt}). In Definition~\ref{def:thething-evnt}, we formally introduce {\thethings} in the context of privacy-preserving data publishing.
% A significant event or item signals its consequence to us, toward us. % A significant event or item signals its consequence to us, toward us.
% https://www.quora.com/What-is-the-difference-between-significant-and-important % https://www.quora.com/What-is-the-difference-between-significant-and-important
@ -47,68 +48,66 @@ We give the definition of {\thethings} below (Definition~\ref{def:thething-evnt}
A {\thething} event is a significant---according to user- or data-related criteria---user-generated data item. A {\thething} event is a significant---according to user- or data-related criteria---user-generated data item.
\end{definition} \end{definition}
Definition~\ref{def:thething-nb} extends the notion of neighboring data sets to the context of {\thethings}. Definition~\ref{def:thething-nb} extends the notion of neighboring data sets (see Section~\ref{subsec:prv-statistical}) to the context of {\thethings}.
\begin{definition} \begin{definition}
[{\Thething} neighboring time series] [{\Thething} neighboring time series]
\label{def:thething-nb} \label{def:thething-nb}
Two time series of equal lengths are \emph{{\thething} neighboring} when they differ by a single {\thething} event. Two time series of the same length, with common starting and ending timestamps, are \emph{{\thething} neighboring} when their elements are pairwise, i.e.,~at the same timestamps, equal or neighboring and their neighboring elements are on common {\thethings} and/or at most on one regular event.
\end{definition} \end{definition}
For example, the time series ($p_1$, \dots, $p_8$) with {\thethings} set the \{$p_1$, $p_3$, $p_5$\} is {\thething} neighboring to the time series of Figure~\ref{fig:scenario}. % For example, the time series ($p_1$, \dots, $p_8$) with {\thethings} set the \{$p_1$, $p_3$, $p_5$\} is {\thething} neighboring to the time series of Figure~\ref{fig:lmdk-scenario}.
Therefore, Corollary~\ref{cor:thething-nb} follows. % Therefore, Corollary~\ref{cor:thething-nb} follows.
\begin{corollary} % \begin{corollary}
\label{cor:thething-nb} % \label{cor:thething-nb}
Two {\thething} neighboring time series are event neighboring as well. % Two {\thething} neighboring time series are event neighboring as well.
\end{corollary} % \end{corollary}
We proceed to propose \emph{{\thething} privacy}, a configurable variation of differential privacy for time series (Definition~\ref{def:thething-prv}). In Definition~\ref{def:thething-prv}, we proceed to propose \emph{{\thething} privacy}, a configurable variation of differential privacy for time series with significant events.
\begin{definition} \begin{definition}
[{\Thething} privacy] [{\Thething} privacy]
\label{def:thething-prv} \label{def:thething-prv}
Let $\mathcal{M}$ be a privacy mechanism with range $\mathcal{O}$ that takes as input a time series. Let $\mathcal{M}$ be a privacy mechanism with range $\mathcal{O}$ and domain $\mathcal{S}_T$ being the set of all time series with length $|T|$, where $T$ is a sequence of timestamps.
We say that $\mathcal{M}$ satisfies {\thething} $\varepsilon$-differential privacy (or, simply, {\thething} privacy) if for all sets of possible outputs $O \subseteq \mathcal{O}$, and for every pair of {\thething}-neighboring time series $S_T$, $S_T'$, $\mathcal{M}$ satisfies {\thething} $\varepsilon$-differential privacy (or, simply, {\thething} privacy) if for all sets $O \subseteq \mathcal{O}$, and for every pair of {\thething}-neighboring time series $S_T$, $S_T'$, it holds that
% and all $T = (t)_{t \in \mathbb{N}}$,
it holds that
$$Pr[\mathcal{M}(S_T) \in O] \leq e^\varepsilon Pr[\mathcal{M}(S_T') \in O]$$ $$Pr[\mathcal{M}(S_T) \in O] \leq e^\varepsilon Pr[\mathcal{M}(S_T') \in O]$$
\end{definition} \end{definition}
User-level privacy can achieve {\thething} privacy, but it over-perturbs the final data by not distinguishing into {\thething} and regular events. User-level privacy can achieve {\thething} privacy, but it over-perturbs the final data by not distinguishing between {\thething} and regular events.
Theorem~\ref{theor:thething-prv} proposes how to achieve the desired privacy for the {\thethings} (i.e.,~a total budget lower than $\varepsilon$), and in the same time provide better quality overall. Theorem~\ref{theor:thething-prv} states how to achieve the desired privacy goal for the {\thethings} and any event, i.e.,~a total budget less than $\varepsilon$, and at the same time provide better utility overall.
\begin{theorem} \begin{theorem}
[{\Thething} privacy] [{\Thething} privacy]
\label{theor:thething-prv} \label{theor:thething-prv}
Let $\mathcal{M}$ be a mechanism with input a time series $S_T$, where $T$ is the set of the involved timestamps, and $L \subseteq T$ be the set of {\thething} timestamps. Let $\mathcal{M}$ be a mechanism with input a time series $S_T$, where $T$ is the set of the involved timestamps, and $L \subseteq T$ be the set of {\thething} timestamps.
$\mathcal{M}$ is decomposed to $\varepsilon$-differential private sub-mechanisms $\mathcal{M}_t$, for every $t \in T$, that apply independent randomness to the data item at $t$. $\mathcal{M}$ is decomposed to $\varepsilon$-differential private sub-mechanisms $\mathcal{M}_t$, for every $t \in T$, which apply independent randomness to the event at $t$.
Then, given a privacy budget $\varepsilon$, $\mathcal{M}$ satisfies {\thething} privacy if for every $t$ it holds that Then, given a privacy budget $\varepsilon$, $\mathcal{M}$ satisfies {\thething} privacy if for any $t$ it holds that
$$ \sum_{i\in L \cup \{t\}} \varepsilon_i \leq \varepsilon$$ $$ \sum_{i\in L \cup \{t\}} \varepsilon_i \leq \varepsilon$$
\end{theorem} \end{theorem}
\begin{proof} \begin{proof}
\label{pf:thething-prv} \label{pf:thething-prv}
All mechanisms use independent randomness, and therefore for a time series $S_T = {D_1, \dots, D_T}$ and outputs $(\pmb{o}_1, \dots, \pmb{o}_T) \in O \subseteq \mathcal{O}$ it holds that All mechanisms use independent randomness, and therefore for a time series $S_T = (D_i)_{i \in T}$ and outputs $(\pmb{o}_i)_{i \in T} \in O \subseteq \mathcal{O}$ it holds that
$$Pr[\mathcal{M}(S_T) = (\pmb{o}_1, \dots, \pmb{o}_T)] = \prod_{i \in [1, T]} Pr[\mathcal{M}_i(D_i) = \pmb{o}_i]$$ $$Pr[\mathcal{M}(S_T) = (\pmb{o}_i)_{i \in T}] = \prod_{i \in T} Pr[\mathcal{M}_i(D_i) = \pmb{o}_i]$$
Likewise, for any {\thething}-neighboring time series $S'_T$ of $S_T$ with the same outputs $(\pmb{o}_1, \dots, \pmb{o}_T) \in O \subseteq \mathcal{O}$ Likewise, for any {\thething}-neighboring time series $S'_T$ of $S_T$ with the same outputs $(\pmb{o}_i)_{i \in T} \in O \subseteq \mathcal{O}$
$$Pr[\mathcal{M}(S'_T) = (\pmb{o}_1, \dots, \pmb{o}_T)] = \prod_{i \in [1, T]} Pr[\mathcal{M}_i(D'_i) = \pmb{o}_i]$$ $$Pr[\mathcal{M}(S'_T) = (\pmb{o}_i)_{i \in T}] = \prod_{i \in T} Pr[\mathcal{M}_i(D'_i) = \pmb{o}_i]$$
Since $S_T$ and $S'_T$ are {\thething}-neighboring, there exists $i \in T$ such that $D_i = D'_i$ for a set of {\thethings} with timestamps $L$. According to Definition~\ref{def:thething-nb}, there exists $L \cup \{t\} \subseteq T$ such that $D_i = D'_i$ for $i \in L \cup \{t\}$.
Thus, we get Thus, we get
$$\frac{Pr[\mathcal{M}(S_T) = (\pmb{o}_1, \dots, \pmb{o}_T)]}{Pr[\mathcal{M}(S'_T) = (\pmb{o}_1, \dots, \pmb{o}_T)]} = \prod_{i \in L \cup \{t\}} \frac{Pr[\mathcal{M}_i(D_i) = \pmb{o}_i]}{Pr[\mathcal{M}_i(D'_i) = \pmb{o}_i]}$$ $$\frac{Pr[\mathcal{M}(S_T) = (\pmb{o}_i)_{i \in T}]}{Pr[\mathcal{M}(S'_T) = (\pmb{o}_i)_{i \in T}]} = \prod_{i \in L \cup \{t\}} \frac{Pr[\mathcal{M}_i(D_i) = \pmb{o}_i]}{Pr[\mathcal{M}_i(D'_i) = \pmb{o}_i]}$$
$D_i$ and $D'_i$ are neighboring for $i \in L \cup \{t\}$. $D_i$ and $D'_i$ are neighboring for $i \in L \cup \{t\}$.
$\mathcal{M}_i$ is differential private and from Definition~\ref{def:dp} we get that $\frac{Pr[\mathcal{M}_i(D_i) = \pmb{o}_i]}{Pr[\mathcal{M}_i(D'_i) = \pmb{o}_i]} \leq e^{\varepsilon_i}$. $\mathcal{M}_i$ is differential private and from Definition~\ref{def:dp} we get that $\frac{Pr[\mathcal{M}_i(D_i) = \pmb{o}_i]}{Pr[\mathcal{M}_i(D'_i) = \pmb{o}_i]} \leq e^{\varepsilon_i}$.
Hence, we can write Hence, we can write
$$\frac{Pr[\mathcal{M}(S_T) = (\pmb{o}_1, \dots, \pmb{o}_T)]}{Pr[\mathcal{M}(S'_T) = (\pmb{o}_1, \dots, \pmb{o}_T)]} \leq \prod_{i \in L \cup \{t\}} e^{\varepsilon_i} = e^{\sum_{i \in L \cup \{t\}} \varepsilon_i}$$ $$\frac{Pr[\mathcal{M}(S_T) = (\pmb{o}_i)_{i \in T}]}{Pr[\mathcal{M}(S'_T) = (\pmb{o}_i)_{i \in T}]} \leq \prod_{i \in L \cup \{t\}} e^{\varepsilon_i} = e^{\sum_{i \in L \cup \{t\}} \varepsilon_i}$$
For any $O \in \mathcal{O}$ we get $\frac{Pr[\mathcal{M}(S_T) \in O}{Pr[\mathcal{M}(S'_T) \in O]} \leq e^{\sum_{i \in L \cup \{t\}} \varepsilon_i}$. For any $O \in \mathcal{O}$ we get $\frac{Pr[\mathcal{M}(S_T) \in O]}{Pr[\mathcal{M}(S'_T) \in O]} \leq e^{\sum_{i \in L \cup \{t\}} \varepsilon_i}$.
If the formula of Theorem~\ref{theor:thething-prv} holds, then we get $\frac{Pr[\mathcal{M}(S_T) \in O}{Pr[\mathcal{M}(S'_T) \in O]} \leq e^\varepsilon$. If the formula of Theorem~\ref{theor:thething-prv} holds, then we get $\frac{Pr[\mathcal{M}(S_T) \in O]}{Pr[\mathcal{M}(S'_T) \in O]} \leq e^\varepsilon$.
Due to Definition~\ref{def:thething-prv} this concludes our proof. Due to Definition~\ref{def:thething-prv} this concludes our proof.
\end{proof} \end{proof}

View File

@ -1,110 +1,101 @@
\subsection{Achieving {\thething} privacy} \subsection{Achieving {\thething} privacy}
\label{subsec:lmdk-sol} \label{subsec:lmdk-sol}
In this section, we propose the methodology for achieving {\thething} privacy.
\subsubsection{{\Thething} privacy mechanisms} \subsubsection{{\Thething} privacy mechanisms}
\label{subsec:lmdk-mechs} \label{subsec:lmdk-mechs}
% \kat{add the two models -- uniform and dynamic and skip}
\paragraph{Uniform} \paragraph{\texttt{Uniform}}
%\kat{isn't the uniform distribution a method? there is a section for the methods. } Figure~\ref{fig:lmdk-uniform} shows the implementation of the baseline {\thething} privacy scheme for Example~\ref{ex:st-cont} which distributes uniformly the available privacy budget $\varepsilon$.
Figure~\ref{fig:lmdk-uniform} shows the simplest model that implements Theorem~\ref{theor:thething-prv}, the \emph{Uniform} distribution of privacy budget $\varepsilon$ for {\thething} privacy. In this case, it is enough to distribute at each timestamp the total privacy budget divided by the number of timestamps corresponding to {\thethings}, plus one, i.e.,~$\frac{\varepsilon}{|L| + 1}$.
% \mk{We capitalize the first letter because it's the name of the method.}
% in comparison with user-level protection.
In this case, it is enough to distribute at each timestamp the total privacy budget divided by the number of timestamps corresponding to {\thethings}, plus one if we are releasing a regular timestamp.
Consequently, at each timestamp we protect every {\thething}, while reserving a part of $\varepsilon$ for the current timestamp. Consequently, at each timestamp we protect every {\thething}, while reserving a part of $\varepsilon$ for the current timestamp.
%In this case, distributing $\frac{\varepsilon}{5}$ can guarantee {\thething} privacy.
\begin{figure}[htp] \begin{figure}[htp]
\centering \centering
\includegraphics[width=.9\linewidth]{problem/lmdk-uniform} \includegraphics[width=.75\linewidth]{problem/lmdk-uniform}
\caption{Uniform application scenario of {\thething} privacy.} \caption{The \texttt{Uniform} application scenario of {\thething} privacy.}
\label{fig:lmdk-uniform} \label{fig:lmdk-uniform}
\end{figure} \end{figure}
\paragraph{Skip} \paragraph{\texttt{Skip}}
% Why skipping publications is problematic? One might argue that we could skip the {\thething} data releases as we demonstrate in Figure~\ref{fig:lmdk-skip}, by republishing previous, regular event releases.
One might argue that we could \emph{Skip} the \thething\ data releases. This would result in preserving all of the available privacy budget for regular events, equivalently to event-level protection, i.e.,~$\varepsilon_i = \varepsilon$, $\forall i \in T /\ L$.
% and limit the number of {\thethings}.
This would result in preserving all of the available privacy budget for regular events (because the set $L \cup \{t\}$ becomes $\{t\}$), equivalently to event-level protection.
In practice, however, this approach can eventually pose arbitrary privacy risks, especially when dealing with geotagged data.
Particularly, sporadic location data publishing~\cite{gambs2010show, russell2018fitness} or misapplying location cloaking~\cite{xssfopes2020tweet} could result in areas with sparse data points, indicating privacy-sensitive locations.
\begin{figure}[htp] \begin{figure}[htp]
\centering \centering
\includegraphics[width=.9\linewidth]{problem/lmdk-skip} \includegraphics[width=.75\linewidth]{problem/lmdk-skip}
\caption{Application scenario of the Skip model in {\thething} privacy.} \caption{Application scenario of the \texttt{Skip} {\thething} privacy scheme.}
\label{fig:lmdk-skip} \label{fig:lmdk-skip}
\end{figure} \end{figure}
In practice, however, this approach can eventually pose arbitrary privacy risks, especially when dealing with geotagged data.
Particularly, sporadic location data publishing or misapplying location cloaking could result in areas with sparse data points, indicating privacy-sensitive locations~\cite{gambs2010show, russell2018fitness}.
We study this problem and investigate possible solutions in Section~\ref{subsec:lmdk-sel-sol}.
\paragraph{Adaptive}
Next, we propose an \emph{Adaptive} privacy mechanism taking into account changes in the input data and exploiting the post-processing property of differential privacy (Figure~\ref{fig:lmdk-adaprive}). \paragraph{\texttt{Adaptive}}
Next, we propose an adaptive privacy scheme (Figure~\ref{fig:lmdk-adaptive}) that accounts for changes in the input data by exploiting the post-processing property of differential privacy (Theorem~\ref{theor:p-proc}).
\begin{figure}[htp] \begin{figure}[htp]
\centering \centering
\includegraphics[width=.5\linewidth]{problem/lmdk-adaptive} \includegraphics[width=.75\linewidth]{problem/lmdk-adaptive}
\caption{Adaptive application scenario of {\thething} privacy.} \caption{Concept of \texttt{Adaptive} {\thething} privacy.}
\label{fig:lmdk-adaprive} \label{fig:lmdk-adaprive}
\end{figure} \end{figure}
Initially, its budget management component reserves uniformly the available privacy budget for each future release. Initially, its budget management component reserves uniformly the available privacy budget $\varepsilon$ for each future release $\mathbf{o}$.
At each timestamp, it performs an analysis on the released data and based on that it adjusts the sampling rate of its processing component. At each timestamp, the processing component decides to either sample from the time series the current input and publish it with noise or release an approximation based on previous releases.
At each timestamp, the processing component decides to either publish with noise the original data or it releases an approximation based on previous releases. In the case when it publishes with noise the original data, the analysis component estimates the data trends by calculating the difference between the current and the previous releases and compares the difference with the scale of the perturbation, i.e.,~$\frac{\Delta f}{\varepsilon}$~\cite{kellaris2014differentially}.
In the case when it publishes with noise the original data, the analysis component estimates the data trends by calculating the difference between the current and the previous release and compares the difference with the scale of the perturbation ($\frac{\Delta f}{\varepsilon}$).
The outcome of this comparison determines the adaptation of the sampling rate of the processing component for the next events: The outcome of this comparison determines the adaptation of the sampling rate of the processing component for the next events:
if the scale is greater it means that the data trends are evolving, and therefore it must decrease the sampling rate. if the difference is greater it means that the data trends are evolving, and therefore it must increase the sampling rate.
In the case when the mechanism approximates a {\thething} (but not a regular timestamp), the budget management component distributes the reserved privacy budget In the case when the mechanism approximates a {\thething} (but not a regular timestamp), the budget management component distributes the reserved privacy budget to the next timestamps.
% divided by the number of remaining {\thething} plus one Due to the post-processing property of differential privacy (Theorem~\ref{theor:p-proc}), the analysis component does not consume any privacy budget allowing for better final data utility.
to the next timestamps.
\subsubsection{{\Thething} privacy under temporal correlation} \subsubsection{{\Thething} privacy under temporal correlation}
\label{subsec:lmdk-tpl} \label{subsec:lmdk-tpl}
From the discussion so far, it is evident that for the budget distribution it is not the positions, but rather the number of the {\thethings} that matters.
However, this is not the case under the presence of temporal correlation.
From the discussion so far, it is evident that for the budget distribution it is not the positions but rather the number of the {\thethings} that matters. The Hidden Markov Model scheme (as used in~\cite{cao2018quantifying}) stipulates two important independence properties: (i)~the future (or past) depends on the past (or future) via the present, and (ii)~the current observation is independent of the rest given the current state.
However, this is not the case under the presence of temporal correlation, which is inherent in continuously generated data.
% HMMs have two important independence properties:
% Markov hidden process: future depends on past via the present.
% Current observation independent of all else given current state.
% Intuitively, D^t or D^{t+1} "cuts off" the propagation of the Markov chain.
The Hidden Markov Model~\cite{baum1966statistical} stipulates two important independence properties: (i)~the future(past) depends on the past(future) via the present, and (ii)~the current observation is independent of the rest given the current state.
%Thus, the observation of a data release at a timestamp $t$ depends only on the respective input data set $D_t$, i.e.,~the current state.
Hence, there is independence between an observation at a specific timestamp and previous/next data sets under the presence of the current input data set. Hence, there is independence between an observation at a specific timestamp and previous/next data sets under the presence of the current input data set.
Intuitively, knowing the data set at timestamp $t$ stops the propagation of the Markov chain towards the next or previous timestamps. Intuitively, knowing the data set at timestamp $t$ stops the propagation of the Markov chain towards the next or previous timestamps in the time series.
%\kat{do we see this in the formula 1 ?}
%when calculating the forward or backward privacy loss respectively.
Cao et al.~\cite{cao2017quantifying} propose a method for computing the temporal privacy loss $\alpha_t$ at a timestamp $t$ as the sum of the backward and forward privacy loss, $\alpha^B_t$ and $\alpha^F_t$, minus the privacy budget $\varepsilon_t$ In Section~\ref{subsec:compo} we showed that the temporal privacy loss $\alpha_t$ at a timestamp $t$ is calculated as the sum of the backward and forward privacy loss, $\alpha^B_t$ and $\alpha^F_t$, minus the privacy budget $\varepsilon_t$, to account for the extra privacy loss due to previous and next releases $\pmb{o}$ of $\mathcal{M}$ under temporal correlation.
to account for the extra privacy loss due to previous and next releases $\pmb{o}$ of $\mathcal{M}$ under temporal correlation.
By Theorem~\ref{theor:thething-prv}, at every timestamp $t$ we consider the data at $t$ and at the {\thething} timestamps $L$. By Theorem~\ref{theor:thething-prv}, at every timestamp $t$ we consider the data at $t$ and at the {\thething} timestamps $L$.
%According to the Definitions~{\ref{def:bpl} and \ref{def:fpl}}, we calculate the backward and forward privacy loss by taking into account the privacy budget at previous and next data releases respectively. When sequentially composing the data releases for each timestamp $i$ in $L \cup \{t\}$ we consider the previous releases in the whole time series until the timestamp $i^{-}$ that is exactly before $i$ in the ordered $L \cup \{t\}$, and the next data releases in the whole time series until the timestamp $ i^{+}$ that is exactly after $i$ in the ordered $L \cup \{t\}$.
When sequentially composing the data releases for each timestamp $i$ in $L \cup \{t\}$ we Figure~\ref{fig:lmdk-tpl} illustrates $i^{-}$ and $i^{+}$ in Example~\ref{ex:scenario}).
%calculate the temporal privacy loss $\alpha_t$ at each timestamp $t \in L \cup \{i\}$ by
%consider the previous and next data releases at the timestamps $i^{-}, i^{+} \in L \cup \{t\} \setminus \{i\}$ respectively.
consider the previous releases in the whole time series until the timestamp $i^{-}$ that is exactly before $i$ in the ordered $L {\cup} \{t\}$, and the next data releases in the whole time series until the timestamp $ i^{+}$ that is exactly after $i$ in the ordered $L {\cup }\{t\} $.
%\kat{not sure I understand}
%Thus, we calculate the backward/forward privacy loss by taking into account the data releases after/before the previous/next data item.
That is:
% \dk{do we keep looking at all Landmarks both for backward and forward? I would assume that for backward we are looking to the Landmarks until the i and for the forward to the Landmarks after the i - if we would like to be consistent with Cao. Otherwise the writing here is confusing.}
% \mk{We are discussing about the case where we calculate the tpl at each timestamp i in L+{t}. Therefore, bpl at i is calculated until i- and fpl at i until i+.}
\begin{align} \begin{figure}[htp]
\adjustbox{max width=0.9\linewidth}{ \centering
$\alpha_i = \includegraphics[width=.75\linewidth]{problem/lmdk-tpl}
\underbrace{\ln \frac{\Pr[\pmb{o}_{i^{-} + 1}, \dots, \pmb{o}_i | D_i]}{\Pr[\pmb{o}_{i^{-} + 1}, \dots, \pmb{o}_i | D'_i]}}_{\alpha^B_i} + \caption{The timestamps exactly before ($-$) and after ($+$) every timestamp, where that is applicable, for the calculation of the temporal privacy loss.}
\underbrace{\ln \frac{\Pr[\pmb{o}_{i^{+} - 1}, \dots, \pmb{o}_i | D_i]}{\Pr[\pmb{o}_{i^{+} - 1}, \dots, \pmb{o}_i | D'_i]}}_{\alpha^F_i} - \label{fig:lmdk-tpl}
\underbrace{\ln \frac{\Pr[\pmb{o}_i | D_i]}{\Pr[\pmb{o}_i | D'_i]}}_{\varepsilon_i}$ \end{figure}
}
\end{align}
Finally, $\alpha_t$ is equal to the sum of all $\alpha_i , i\in L \cup\{t\}$. Therefore, in Definition~\ref{def:lmdk-tpl}, we formulate the {\thething} temporal privacy loss as follows.
% \begin{definition}
% where $x_t$ (or $x'_t$) is the potential (neighboring) data item of an individual who is targeted by an adversary with knowledge $\mathbb{D}_t$. [{\Thething} temporal privacy loss]
%where $D_t$ and $D'_t$ are the neighboring input data sets (Definition~\ref{def:nb-d-s}) responsible for the output $\pmb{o}_t$. \label{def:lmdk-tpl}
%Notice that if $t$ is the first or last item in $L \cup \{i\}$ then we need to set $t_{\text{prv}} = 0$ or $t_{\text{nxt}} = \max(T) + 1$. Given a {\thething} set $L$ in a set of timestamps $T$, the potential overall temporal privacy loss of a privacy mechanism $\mathcal{M}$ at any timestamp in $L \cup \{t\}$ is
$$\sum_{i \in L \cup \{t\}} \alpha_i$$
%In Section~\ref{sec:eval}, we experimentally show how the distribution of {\thethings} impacts the overall privacy loss of the user. where for $i^{-}, i^{+} \in L \cup \{t\}$ being the timestamps exactly before and after $i$, $\alpha_i$ is equal to
\begin{align}
\label{eq:lmdk-tpl}
\adjustbox{max width=0.9\linewidth}{
$\underbrace{\ln \frac{\Pr[(\pmb{o})_{i \in [i^{-} + 1, i]} | D_i]}{\Pr[(\pmb{o})_{i \in [i^{-} + 1, i]} | D'_i]}}_{\alpha^B_i} +
\underbrace{\ln \frac{\Pr[(\pmb{o})_{i \in [i, i^{+} - 1]} | D_i]}{\Pr[(\pmb{o})_{i \in [i, i^{+} - 1]} | D'_i]}}_{\alpha^F_i} -
\underbrace{\ln \frac{\Pr[\pmb{o}_i | D_i]}{\Pr[\pmb{o}_i | D'_i]}}_{\varepsilon_i}$
}
\end{align}
\end{definition}
As presented in~\cite{cao2018quantifying}, the temporal privacy loss of a time series (without {\thethings}) can be bounded by a given privacy budget $\varepsilon$.
Intuitively, by Equation~\ref{eq:lmdk-tpl} the temporal privacy loss incurred when considering {\thethings} is less than the temporal loss in the case without the knowledge of the {\thethings}.
Thus, the temporal privacy loss in {\thething} privacy can be also bounded by $\varepsilon$.