the-last-thing/text/problem/theotherthing/solution.tex

205 lines
12 KiB
TeX
Raw Normal View History

2021-10-10 19:49:47 +02:00
\subsection{Protecting {\thethings}}
\label{subsec:lmdk-sel-sol}
2021-10-25 02:31:47 +02:00
The main idea of the privacy-preserving dummy {\thething} selection module is to privately select extra {\thething} event timestamps, i.e.,~dummy {\thethings}, from the set of timestamps $T \setminus L$ of the time series $S_T$ and add them to the original {\thething} set $L$.
Selecting extra events, on top of the actual {\thethings}, as dummy {\thethings}, can render the actual ones indistinguishable.
The goal is to create a new set $L'$ such that $L \subset L' \subseteq T$.
2021-10-10 19:49:47 +02:00
2021-10-25 02:31:47 +02:00
First, we generate a set of dummy {\thething} set options by adding regular event timestamps from $T \setminus L$ to $L$ (Section~\ref{subsec:lmdk-set-opts}).
Then, we utilize the exponential mechanism, with a utility function that calculates an indicator for each of the options in the set, based on how much it differs from the original {\thething} set $L$, and randomly select one of the options (Section~\ref{subsec:lmdk-opt-sel}).
This process provides an extra layer of privacy protection to {\thethings}, and thus allows the processing, and thereafter releasing, of {\thething} timestamps.
2021-10-12 04:21:46 +02:00
2021-10-10 19:49:47 +02:00
2021-10-25 02:31:47 +02:00
\subsubsection{Dummy {\thething} selection}
2021-10-12 01:40:27 +02:00
\label{subsec:lmdk-set-opts}
2021-10-10 19:49:47 +02:00
Algorithms~\ref{algo:lmdk-sel-opt} and \ref{algo:lmdk-sel-heur} approach this problem with an optimal and heuristic methodology, respectively.
2021-10-12 04:21:46 +02:00
Function \evalSeq evaluates the result of the union of $L$ and a timestamp combination from $T \setminus L$ by, e.g.,~estimating the standard deviation of all the distances from the previous/next {\thething}.
\getOpts returns all the possible \emph{valid} sets of combinations \opt such that larger options contain all of the timestamps that are present in smaller ones.
Each combination contains a set of timestamps with sizes $\left|L\right| + 1, \left|L\right| + 2, \dots, \left|T\right|$, where each one of them is a combination of $L$ with $x \in [1, \left|T\right| - \left|L\right|]$ timestamps from $T$.
2021-10-10 19:49:47 +02:00
2021-10-25 02:31:47 +02:00
\paragraph{\texttt{Optimal}}
The \texttt{Optimal} algorithm (Algorithm~\ref{algo:lmdk-sel-opt}) generates every possible combination (options) of {\thething} sets $L'$ containing one set from every possible size, i.e,~$|L| + 1, |L| + 2, \dots, |T|$.
Each $L'$ contains the original {\thethings} along with timestamps of regular events from $T \setminus L$ (dummy {\thethings}).
Then, it evaluates each option by comparing each of its sets with the original {\thething} set $L$ and estimating an overall similarity score for each option (Lines~{\ref{algo:lmdk-sel-opt-for-each}--\ref{algo:lmdk-sel-opt-end}}).
We discuss possible utility score functions later on in Section~\ref{subsec:lmdk-opt-sel}.
2021-10-12 04:21:46 +02:00
It finds the option that is the most \emph{similar} to the original (Lines~{\ref{algo:lmdk-sel-opt-comparison}-\ref{algo:lmdk-sel-opt-end}}), i.e.,~the option that has an evaluation that differs the least from that of the sequence $T$ with {\thethings} $L$.
2021-10-25 02:31:47 +02:00
The goal of this process is to select the option that contains the combination of dummy {\thething} sets that achieve the best score.
2021-10-10 19:49:47 +02:00
\begin{algorithm}
2021-10-25 03:35:32 +02:00
\caption{\texttt{Optimal} dummy {\thething} set options generation}
2021-10-10 19:49:47 +02:00
\label{algo:lmdk-sel-opt}
\DontPrintSemicolon
2021-10-25 02:31:47 +02:00
\KwData{the time series timestamps $T$, the {\thething} set $L$}
\KwResult{the selected {\thething} set options \opts}
2021-10-10 19:49:47 +02:00
\BlankLine
% Evaluate the original
2021-10-12 04:21:46 +02:00
\evalOrig $\leftarrow$ \evalSeq{$T, \emptyset, L$}\;
2021-10-10 19:49:47 +02:00
% Track the minimum (best) evaluation
\diffMin $\leftarrow$ $\infty$\;
% Track the optimal sequence (the one with the best evaluation)
2021-10-12 12:01:20 +02:00
\opts $\leftarrow$ $[]$\;
2021-10-10 19:49:47 +02:00
2021-10-12 12:01:20 +02:00
\ForEach{\opt $\in$ \getOpts{$T, L$}}{ \label{algo:lmdk-sel-opt-for-each}
2021-10-12 04:21:46 +02:00
\evalCur $\leftarrow 0$\;
2021-10-10 19:49:47 +02:00
\ForEach{\opti $\in$ \opt}{
2021-10-12 04:21:46 +02:00
\evalCur $\leftarrow$ \evalCur $+$ \evalSeq{$T, \opti, L$}/\#\opt\; \label{algo:lmdk-sel-opt-comparison}
2021-10-10 19:49:47 +02:00
}
2021-10-12 04:21:46 +02:00
% Compare with current optimal
\diffCur $\leftarrow \left|\evalCur - \evalOrig\right|$\;
\If{\diffCur $<$ \diffMin}{
\diffMin $\leftarrow$ \diffCur\;
2021-10-12 12:01:20 +02:00
\opts $\leftarrow$ \opt\;
2021-10-12 04:21:46 +02:00
}
} \label{algo:lmdk-sel-opt-end}
2021-10-12 12:01:20 +02:00
\Return{\opts}
2021-10-10 19:49:47 +02:00
\end{algorithm}
2021-10-25 02:31:47 +02:00
Algorithm~\ref{algo:lmdk-sel-opt} guarantees to return the optimal option with regard to the original set $L$.
However, it is rather costly in terms of complexity.
In more detail, given $|T \setminus L|$ regular events and a combination of size $r$, it requires $O(C(|T \setminus L|, r) + 2^{C(|T \setminus L|, r)})$ time and $O(r*C(|T \setminus L|, r))$ space.
Next, we present a \texttt{Heuristic} solution with improved time and space requirements.
2021-10-12 04:21:46 +02:00
2021-10-25 02:31:47 +02:00
\paragraph{\texttt{Heuristic}}
The \texttt{Heuristic} algorithm (Algorithm~\ref{algo:lmdk-sel-heur}) follows an incremental methodology and at each step it selects a new timestamp, corresponding to a regular event from $T \setminus L'$.
In this case, the elements of $L'$ at each step differ by one from the one that the algorithm selected in the previous step.
Similar to the \texttt{Optimal}, it selects a new set based on a predefined similarity metric until it selects a set that is equal to the size of the series of events, i.e.,~$L' = T$.
2021-10-10 19:49:47 +02:00
\begin{algorithm}
2021-10-25 03:35:32 +02:00
\caption{\texttt{Heuristic} dummy {\thething} set options generation}
2021-10-10 19:49:47 +02:00
\label{algo:lmdk-sel-heur}
\DontPrintSemicolon
2021-10-25 02:31:47 +02:00
\KwData{the time series timestamps $T$, the {\thething} set $L$}
\KwResult{the selected {\thething} set options \opts}
2021-10-10 19:49:47 +02:00
\BlankLine
% Evaluate the original
2021-10-12 04:21:46 +02:00
\evalOrig $\leftarrow$ \evalSeq{$T, \emptyset, L$}\;
2021-10-10 19:49:47 +02:00
% Get all possible option combinations
2021-10-12 11:00:50 +02:00
\opts $\leftarrow$ $[]$\;
2021-10-10 19:49:47 +02:00
2021-10-12 04:21:46 +02:00
$L' \leftarrow L$\;
2021-10-10 19:49:47 +02:00
2021-10-12 04:21:46 +02:00
\While{$L' \neq T$}{\label{algo:lmdk-sel-heur-while}
2021-10-10 19:49:47 +02:00
% Track the minimum (best) evaluation
\diffMin $\leftarrow$ $\infty$\;
2021-10-12 04:21:46 +02:00
\optimi $\leftarrow$ Null\;
2021-10-10 19:49:47 +02:00
% Find the combinations for one more point
2021-10-12 04:21:46 +02:00
\ForEach{\reg $\in T \setminus L'$}{
2021-10-10 19:49:47 +02:00
% Evaluate current
2021-10-12 04:21:46 +02:00
\evalCur $\leftarrow$ \evalSeq{$T, \reg, L'$}\; \label{algo:lmdk-sel-heur-comparison}
2021-10-10 19:49:47 +02:00
% Compare evaluations
\diffCur $\leftarrow$ $\left|\evalCur - \evalOrig\right|$\;
\If{\diffCur $<$ \diffMin}{
\diffMin $\leftarrow$ \diffCur\;
\optimi $\leftarrow$ \reg\;
2021-10-12 11:00:50 +02:00
}\label{algo:lmdk-sel-heur-cmp-end}
2021-10-10 19:49:47 +02:00
}
% Save new point to landmarks
2021-10-12 04:31:54 +02:00
$L'$.add(\optimi)\;
2021-10-10 19:49:47 +02:00
% Add new option
2021-10-12 11:00:50 +02:00
\opts.append($L' \setminus L$)\;
2021-10-10 19:49:47 +02:00
}\label{algo:lmdk-sel-heur-end}
2021-10-12 11:00:50 +02:00
\Return{\opts}
2021-10-10 19:49:47 +02:00
\end{algorithm}
2021-10-12 11:00:50 +02:00
Similar to Algorithm~\ref{algo:lmdk-sel-opt}, it selects new options based on a predefined metric (Lines~{\ref{algo:lmdk-sel-heur-comparison}-\ref{algo:lmdk-sel-heur-cmp-end}}).
2021-10-12 04:21:46 +02:00
This process (Lines~{\ref{algo:lmdk-sel-heur-while}-\ref{algo:lmdk-sel-heur-end}}) goes on until we select a set that is equal to the size of the series of events, i.e.,~$L' = T$.
2021-10-25 02:31:47 +02:00
In terms of complexity, given $|T \setminus L|$ regular events, the \texttt{Heuristic} requires $O(|T \setminus L|^2)$ time and space.
Note that the reverse process, i.e.,~starting with $T$ {\thethings} and removing until $|L'| = |L| + 1$, performs similarly.
2021-10-12 04:21:46 +02:00
2021-10-25 02:31:47 +02:00
\paragraph{\texttt{Partitioned}}
We improve the complexity of the \texttt{Heuristic} algorithm by partitioning the {\thething} timestamp sequence $L$.
The novelty of this algorithm lies in the fact that it deals with the event series as a histogram which allows it to take advantage of its relevant features and methodology.
Particularly, it uses the Freedman-Diaconis rule, which is resilient to outliers and takes into account the data variability and data size~\cite{meshgi2015expanding}, and generates a histogram from the {\thething} set $L$.
This way, it achieves an improved complexity, compared to the \texttt{Heuristic}, that is dependent on the histogram's bin size.
Algorithm~\ref{algo:lmdk-sel-hist} demonstrates the overall process.
2021-10-12 11:00:50 +02:00
\begin{algorithm}
2021-10-25 02:31:47 +02:00
\caption{\texttt{Partitioned} {\thething} set options generation}
2021-10-12 11:00:50 +02:00
\label{algo:lmdk-sel-hist}
\DontPrintSemicolon
2021-10-25 02:31:47 +02:00
\KwData{the time series timestamps $T$, the {\thething} set $L$}
\KwResult{the selected {\thething} set options \opts}
% \kat{verify description of variables}
% \mk{OK}
2021-10-12 11:00:50 +02:00
\BlankLine
\hist, \h $\leftarrow$ \getHist{$T, L$}\;
2021-10-25 02:31:47 +02:00
\histCur $\leftarrow$ \hist\;
2021-10-12 11:00:50 +02:00
\opts $\leftarrow$ $[]$\;
2021-10-25 02:31:47 +02:00
% \kat{L' not defined..}
% \mk{It was histCur}
\While{\sumHist{\histCur} $\neq$ \len{$T$}}{
\label{algo:lmdk-sel-hist-while}
\diffMin $\leftarrow$ $\infty$\; % \tcp*{Track the best evaluation}
\opt $\leftarrow$ \histCur\; % \tcp*{The candidate option}
\ForEach{\hi \textnormal{\textbf{in}} \histCur}{ % \tcp*{Repeat for every bin}
\label{algo:lmdk-sel-hist-cmp-start}
\If{\hi $+$ $1$ $\leq$ \h}{ % \tcp*{Can we add one more point?}
2021-10-12 11:00:50 +02:00
\histTmp $\leftarrow$ \histCur\;
2021-10-25 02:31:47 +02:00
{\histTmp}[$i$] $\leftarrow$ {\histTmp}[$i$] $+$ $1$\;
\diffCur $\leftarrow$ \getDiff{\hist, \histTmp}\; % \tcp*{Find difference from original}
\label{algo:lmdk-sel-hist-getDiff}
\If{\diffCur $<$ \diffMin}{ % \tcp*{Remember if it is the best that you've seen}
\label{algo:lmdk-sel-hist-cmp}
2021-10-12 11:00:50 +02:00
\diffMin $\leftarrow$ \diffCur\;
\opt $\leftarrow$ \histTmp\;
}
}
} \label{algo:lmdk-sel-hist-cmp-end}
2021-10-25 02:31:47 +02:00
\histCur $\leftarrow$ \opt\; % \tcp*{Update current histogram}
\opts $\leftarrow$ \opt\; % \tcp*{Add current best to options}
2021-10-12 11:00:50 +02:00
} \label{algo:lmdk-sel-hist-end}
\Return{\opts}
\end{algorithm}
2021-10-12 04:21:46 +02:00
2021-10-25 02:31:47 +02:00
Function \getHist generates a histogram with bins of size \h for a given time series timestamps $T$ and {\thething} set $L$.
For every new histogram version, the \getDiff function (Line~\ref{algo:lmdk-sel-hist-getDiff}) finds the difference from the original histogram; for this operation it utilizes the Euclidean distance~(see Section~\ref{subsec:sel-utl} for more details).
In Lines~{\ref{algo:lmdk-sel-hist-cmp-start}-\ref{algo:lmdk-sel-hist-cmp-end}}, the algorithm checks every histogram version by incrementing each bin by $1$ and comparing it to the original (Line~\ref{algo:lmdk-sel-hist-cmp}).
In the end, it returns \opts which contains all the versions of \hist that are closest to the original \hist for all possible bin sizes of \hist.
2021-10-10 19:49:47 +02:00
\subsubsection{Privacy-preserving option selection}
2021-10-12 01:40:27 +02:00
\label{subsec:lmdk-opt-sel}
2021-10-25 02:31:47 +02:00
The algorithms that we presented in Section~\ref{subsec:lmdk-set-opts} return a set of possible versions of the original {\thething} set $L$ by adding extra timestamps in it from the series of events at timestamps $T \setminus L$.
In the next step, we randomly select a set by utilizing the exponential mechanism (Section~\ref{subsec:prv-mech}).
For this procedure, we allocate a small fraction of the available privacy budget, i.e.,~$1$\% or even less (see Section~\ref{subsec:sel-eps} for more details), which adds up to that of the publishing scheme according to Theorem~\ref{theor:compo-seq-ind}.
2021-10-19 03:43:57 +02:00
\paragraph{Utility score function}
2021-10-25 02:31:47 +02:00
Prior to selecting a {\thething} timestamp set including the original along with dummy {\thethings}, the exponential mechanism evaluates each set using a utility score function.
We present here two ways of doing so.
2021-10-12 04:21:46 +02:00
2021-10-25 02:31:47 +02:00
One way to evaluate each set is by taking into account the temporal position of the events in the sequence.
2021-10-10 19:49:47 +02:00
Events that occur at recent timestamps are more likely to reveal sensitive information regarding the users involved~\cite{kellaris2014differentially}.
2021-10-25 02:31:47 +02:00
Hence, indicating the existence of dummy {\thethings} nearby actual {\thethings} can increase the adversarial confidence regarding the location of the latter within a series of events.
In other words, sets with dummy {\thethings} with less average temporal distance from actual {\thethings} achieve better utility scores.
Another approach for the utility score function is to consider the number of events in each set.
Sets with more dummy {\thethings} may render actual {\thethings} more indistinguishable, and therefore provide less utility.
Consequently, more dummy {\thethings} lead to distributing the privacy budget to more events, and therefore leading to more robust overall privacy protection.
\paragraph{Option release}
2021-10-25 02:31:47 +02:00
In the last step, the privacy-preserving dummy {\thething} selection module releases a new {\thething} set (including the original {\thethings} along with the dummy ones) from the options that were generated in the previous step, by utilizing the exponential mechanism.
The options generated by the \texttt{Optimal} and \texttt{Heuristic} algorithms contain actual timestamps that can be utilized directly by the {\thething} privacy schemes that we presented in Section~\ref{subsec:lmdk-sol}.
However, the \texttt{Partitioned} algorithm returns histograms instead of timestamps.
Therefore, we need to process the result of the exponential mechanism further by sampling without replacement from the set $T \setminus L$ according to the selected histogram's probability density function.