theotherthing: Intro

This commit is contained in:
Manos Katsomallos 2021-10-10 19:48:33 +02:00
parent 20e73274b3
commit 48c797922a

View File

@ -22,178 +22,36 @@
\section{Selection of events} \section{Selection of events}
\label{sec:theotherthing} \label{sec:theotherthing}
Given a set of {\thethings} at respective timestamps $\{l_k\}$ in a series of events at $\{t_n\}$, such that $\{l_k\} \subseteq \{t_n\}$, a data publisher might release this information by: In Section~\ref{sec:thething}, we introduced the notion of {\thething} events in privacy-preserving time series publishing.
The differentiation among regular and {\thething} events stipulates a privacy budget allocation that deviates from the application of existing differential privacy protection levels.
Based on this novel event categorization, we designed three models (Section~\ref{subsec:lmdk-mechs}) that achieve {\thething} privacy.
For this, we assumed that the timestamps in the {\thething} set $L$ are not privacy sensitive, and therefore we used them in our models as they were.
\begin{enumerate} This may pose a direct or indirect privacy threat to the users.
\item Selecting a set of options (Section~\ref{subsec:lmdk-set-opts}) consisting of different possible versions of $\{l_k\}$. For the former case, we consider the case where we desire to publish $L$ as complimentary information to the release of the event values.
For the latter, the privacy budget is usually an inseparable attribute of the data release which not only quantifies the privacy guarantee to the data generators (users) but also gives an estimate of the data utility to the data consumers (analysts).
\mk{`option' or `candidate'?}
This could be: In Example~\ref{ex:lmdk-risk}, we demonstrate the extreme case of the application of the Skip {\thething} privacy model from Figure~\ref{fig:lmdk-skip}, where we approximate {\thethings} and invest all of the available privacy budget to regular events, i.e.,~$\varepsilon_i = 0 \forall i \in L$.
\begin{itemize} \begin{example}
\item either a random set of $k$ other timestamps similar to the actual {\thething} timestamps (Section~\ref{subsec:lmdk-rnd}), \label{ex:lmdk-risk}
\item or a set including $\{l_k\}$ and $x \in [1, n - k]$ additional dummy timestamps (Section~\ref{subsec:lmdk-dum-gen}).
\end{itemize}
\item Releasing a privacy-preserving version of the {\thething} timestamps (Section~\ref{subsec:priv-opt-sel}). Figure~\ref{fig:lmdk-risk} shows the privacy risks that the application of a {\thething} privacy model that nullifies or approximates outputs, similar to Skip, might cause.
We utilize the exponential mechanism with a utility function that calculates an indicator for each of the options in the set that we selected in the previous step. We point out (in light red shade) the details that might cause indirect information inference.
The utility depends on the positioning of the {\thething} timestamps of an option in the series, e.g.,~the distance from the previous/next {\thething}, the distance from the start/end of the series, etc. In this extreme case, the minimization of the privacy budget in combination with nullifying the output (either by not publishing or by adding a lot of noise) or approximating the current output with previously released outputs might hint to any adversary that the current event is a {\thething}.
\end{enumerate}
Following this process allows the release, and thereafter processing, of {\thething} timestamps. \begin{figure}[htp]
Thus, we provide an extra layer of privacy protection when we separate {\thethings} from regular events. \centering
\includegraphics[width=\linewidth]{problem/lmdk-risk}
\caption{The privacy risks (in light red shade) that the application of the {\thething} privacy Skip model might pose.}
\label{fig:lmdk-risk}
\end{figure}
Apart from the privacy budget that we invested at {\thethings}, we can also observe a pattern for the budgets at regular events as well.
Therefore, an adversary who observes the values of the privacy budget can easily infer not only the number but also the exact temporal position of {\thethings}.
\subsubsection{{\Thething} set options} \end{example}
\label{subsec:lmdk-set-opts}
This step aims to select a set of candidate {\thething} timestamps options either by randomizing the actual timestamps (Section~\ref{subsec:lmdk-rnd}), or by inserting dummy timestamps (Section~\ref{subsec:lmdk-dum-gen}) to the actual {\thething} timestamps. \input{problem/theotherthing/contribution}
\input{problem/theotherthing/problem}
\input{problem/theotherthing/solution}
\paragraph{{\Thething} randomization}
\label{subsec:lmdk-rnd}
A simple way to select a set of timestamps without disclosing the actual {\thethings} is by \emph{randomly} selecting an equally sized set of timestamps.
The randomization of the process, as we will discuss in more detail in Section~\ref{subsec:priv-opt-sel}, will depend on the positioning of the {\thethings} in the series of events.
In more detail, given a set of {\thething} timestamps $\{l_k\} \subseteq \{t_n\}$, where $\{t_n\}$ is an event sequence, we need to select all possible sets of size $k$ from $\{t_n\}$.
However, the introduction of randomization could impact arbitrarily the effectiveness of non-uniform privacy-protection methods.
This applies mainly in cases where we try to achieve optimal privacy-protection of {\thething} events while maximizing the utility of the data that corresponds to the rest of the series of events.
As a consequence, it is possible to end up providing lower levels of protection to {\thething} data than the one necessary, i.e.,~worse than the users' privacy-protection expectations.
The methodology that we present next (Section~\ref{subsec:lmdk-dum-gen}) attempts to tackle the aforementioned shortcoming.
\paragraph{Dummy {\thething} generation}
\label{subsec:lmdk-dum-gen}
Selecting extra events, on top of the actual {\thethings}, as dummy {\thethings} can render actual ones indistinguishable.
The goal is to select a list of sets with additional timestamps from a series of events at timestamps $\{t_n\}$ for a set of {\thethings} at $\{l_k\} \subseteq \{t_n\}$.
Algorithms~\ref{algo:lmdk-sel-opt} and \ref{algo:lmdk-sel-heur} approach this problem with an optimal and heuristic methodology, respectively.
Function \calcMetric measures an indicator for the union of $\{l_k\}$ and a timestamp combination from $\{t_n\} \setminus \{l_k\}$.
Function \evalSeq evaluates the result of \calcMetric by, e.g.,~estimating the standard deviation of all the distances from the previous/next {\thething}.
Function \getOpts returns all possible \emph{valid} sets of combinations \opt such that $\{l_{k+i}\} \subset \{l_{k+j}\}, \forall i, j \in [k, n] \mid i < j$, i.e.,~larger options must contain all of the timestamps that are present in smaller ones.
Each combination contains a set of timestamps with sizes $k + 1, k + 2, \dots, n$, where each one of them is a combination of $\{l_k\}$ with $x \in [1, n - k]$ timestamps from $\{t_n\}$.
\begin{algorithm}
\caption{Optimal dummy {\thething} set options selection}
\label{algo:lmdk-sel-opt}
\DontPrintSemicolon
\KwData{$\{t_n\}, \{l_k\}$}
\SetKwInput{KwData}{Input}
\KwResult{\optim}
\BlankLine
% Evaluate the original
\metricOrig $\leftarrow$ \calcMetric{$\{t_n\}, \emptyset, \{l_k\}$}\;
\evalOrig $\leftarrow$ \evalSeq{\metricOrig}\;
% Get all possible option combinations
\opts $\leftarrow$ \getOpts{$\{t_n\}, \{l_k\}$}\;
% Track the minimum (best) evaluation
\diffMin $\leftarrow$ $\infty$\;
% Track the optimal sequence (the one with the best evaluation)
\optim $\leftarrow$ $[]$\;
\ForEach{\opt $\in$ \opts}{\label{algo:lmdk-sel-opt-for-each}
\evalSum $\leftarrow 0$\;
\ForEach{\opti $\in$ \opt}{
\metricCur $\leftarrow$ \calcMetric{$\{t_n\}, \opti, \{l_k\}$}\;\label{algo:lmdk-sel-opt-comparison}
\evalSum $\leftarrow$ \evalSum $+$ \evalSeq{\metricCur}\;
% Compare with current optimal
\diffCur $\leftarrow \left|\evalSum/\#\opt - \evalOrig\right|$\;
\If{\diffCur $<$ \diffMin}{
\diffMin $\leftarrow$ \diffCur\;
\optim $\leftarrow$ \opt\;
}
}
}\label{algo:lmdk-sel-opt-end}
\Return{\optim}
\end{algorithm}
Algorithm~\ref{algo:lmdk-sel-opt}, in particular, between Lines~{\ref{algo:lmdk-sel-opt-for-each}-\ref{algo:lmdk-sel-opt-end}} evaluates each option in \opts.
It finds the option that is the most \emph{similar} to the original (Lines~{\ref{algo:lmdk-sel-opt-comparison}-\ref{algo:lmdk-sel-opt-end}}), i.e.,~the option that has an evaluation that differs the least from that of the sequence $\{t_n\}$ with {\thethings} $\{l_k\}$.
\begin{algorithm}
\caption{Heuristic dummy {\thething} set options selection}
\label{algo:lmdk-sel-heur}
\DontPrintSemicolon
\KwData{$\{t_n\}, \{l_k\}$}
\KwResult{\optim}
\BlankLine
% Evaluate the original
\metricOrig $\leftarrow$ \calcMetric{$\{t_n\}, \emptyset, \{l_k\}$}\;
\evalOrig $\leftarrow$ \evalSeq{\metricOrig}\;
% Get all possible option combinations
\optim $\leftarrow$ $[]$\;
$\{l_{k'}\} \leftarrow \{l_k\}$\;
\While{$\{l_{k'}\} \neq \{t_n\}$}{\label{algo:lmdk-sel-heur-while}
% Track the minimum (best) evaluation
\diffMin $\leftarrow$ $\infty$\;
\optimi $\leftarrow$ $0$\;
% Find the combinations for one more point
\ForEach{\reg $\in \{t_n\} \setminus \{l_{k'}\}$}{
% Evaluate current
\metricCur $\leftarrow$ \calcMetric{$\{t_n\}, \reg, \{l_{k'}\}$}\;\label{algo:lmdk-sel-heur-comparison}
\evalCur $\leftarrow$ \evalSeq{\metricCur}\;
% Compare evaluations
\diffCur $\leftarrow$ $\left|\evalCur - \evalOrig\right|$\;
\If{\diffCur $<$ \diffMin}{
\diffMin $\leftarrow$ \diffCur\;
\optimi $\leftarrow$ \reg\;
}\label{algo:lmdk-sel-heur-comparison-end}
}
% Save new point to landmarks
$k' \leftarrow k' + 1$\;
$l_{k'} \leftarrow \optimi$\;
% Add new option
\optim.add($\{l_{k'}\} \setminus \{l_k\}$)\;
}\label{algo:lmdk-sel-heur-end}
\Return{\optim}
\end{algorithm}
Algorithm~\ref{algo:lmdk-sel-heur}, follows an incremental methodology.
At each step it selects a new timestamp that corresponds to a regular ({non-\thething}) event from $\{t_n\} \setminus \{l_k\}$.
Similar to Algorithm~\ref{algo:lmdk-sel-opt}, the selection is done based on a predefined metric (Lines~{\ref{algo:lmdk-sel-heur-comparison}-\ref{algo:lmdk-sel-heur-comparison-end}}).
This process (Lines~{\ref{algo:lmdk-sel-heur-while}-\ref{algo:lmdk-sel-heur-end}}) goes on until we select a set that is equal to the size of the series of events, i.e.,~$\{l_{k'}\} = \{t_n\}$.
Note that the reverse heuristic approach, i.e.,~starting with $\{t_n\}$ {\thethings} and removing until $\{l_k\}$, performs worse than and occasionally the same with Algorithm~\ref{algo:lmdk-sel-heur}.
\subsubsection{Privacy-preserving option selection}
\label{subsec:priv-opt-sel}
% Nearby events
Events that occur at recent timestamps are more likely to reveal sensitive information regarding the users involved~\cite{kellaris2014differentially}.
Thus, taking into account more recent events with respect to {\thethings} can result in less privacy loss and better privacy protection overall.
This leads to worse data utility.
% Depending on the {\thething} discovery technique
The values of events near a {\thething} are usually similar to that of the latter.
Therefore, privacy-preserving mechanisms are likely to approximate their values based on the nearest {\thething} instead of investing extra privacy budget to perturb their actual values; thus, spending less privacy budget.
Saving privacy budget for releasing perturbed versions of actual event values can bring about better data utility.
% Distant events
However, indicating the existence of randomized/dummy {\thethings} nearby actual {\thethings} can increase the adversarial confidence regarding the location of the latter within a series of events.
Hence, choosing randomized/dummy {\thethings} far from the actual {\thethings} (and thus less relevant) can limit the final privacy loss.