the-last-thing/text/problem/theotherthing/solution.tex

205 lines
12 KiB
TeX

\subsection{Protecting {\thethings}}
\label{subsec:lmdk-sel-sol}
The main idea of the privacy-preserving dummy {\thething} selection module is to privately select extra {\thething} event timestamps, i.e.,~dummy {\thethings}, from the set of timestamps $T \setminus L$ of the time series $S_T$ and add them to the original {\thething} set $L$.
Selecting extra events, on top of the actual {\thethings}, as dummy {\thethings}, can render the actual ones indistinguishable.
The goal is to create a new set $L'$ such that $L \subset L' \subseteq T$.
First, we generate a set of dummy {\thething} set options by adding regular event timestamps from $T \setminus L$ to $L$ (Section~\ref{subsec:lmdk-set-opts}).
Then, we utilize the exponential mechanism, with a utility function that calculates an indicator for each of the options in the set, based on how much it differs from the original {\thething} set $L$, and randomly select one of the options (Section~\ref{subsec:lmdk-opt-sel}).
This process provides an extra layer of privacy protection to {\thethings}, and thus allows the processing, and thereafter releasing, of {\thething} timestamps.
\subsubsection{Dummy {\thething} selection}
\label{subsec:lmdk-set-opts}
Algorithms~\ref{algo:lmdk-sel-opt} and \ref{algo:lmdk-sel-heur} approach this problem with an optimal and heuristic methodology, respectively.
Function \evalSeq evaluates the result of the union of $L$ and a timestamp combination from $T \setminus L$ by, e.g.,~estimating the standard deviation of all the distances from the previous/next {\thething}.
\getOpts returns all the possible \emph{valid} sets of combinations \opt such that larger options contain all of the timestamps that are present in smaller ones.
Each combination contains a set of timestamps with sizes $\left|L\right| + 1, \left|L\right| + 2, \dots, \left|T\right|$, where each one of them is a combination of $L$ with $x \in [1, \left|T\right| - \left|L\right|]$ timestamps from $T$.
\paragraph{\texttt{Optimal}}
The \texttt{Optimal} algorithm (Algorithm~\ref{algo:lmdk-sel-opt}) generates every possible combination (options) of {\thething} sets $L'$ containing one set from every possible size, i.e,~$|L| + 1, |L| + 2, \dots, |T|$.
Each $L'$ contains the original {\thethings} along with timestamps of regular events from $T \setminus L$ (dummy {\thethings}).
Then, it evaluates each option by comparing each of its sets with the original {\thething} set $L$ and estimating an overall similarity score for each option (Lines~{\ref{algo:lmdk-sel-opt-for-each}--\ref{algo:lmdk-sel-opt-end}}).
We discuss possible utility score functions later on in Section~\ref{subsec:lmdk-opt-sel}.
It finds the option that is the most \emph{similar} to the original (Lines~{\ref{algo:lmdk-sel-opt-comparison}-\ref{algo:lmdk-sel-opt-end}}), i.e.,~the option that has an evaluation that differs the least from that of the sequence $T$ with {\thethings} $L$.
The goal of this process is to select the option that contains the combination of dummy {\thething} sets that achieve the best score.
\begin{algorithm}
\caption{\texttt{Optimal} dummy {\thething} set options generation}
\label{algo:lmdk-sel-opt}
\DontPrintSemicolon
\KwData{the time series timestamps $T$, the {\thething} set $L$}
\KwResult{the selected {\thething} set options \opts}
\BlankLine
% Evaluate the original
\evalOrig $\leftarrow$ \evalSeq{$T, \emptyset, L$}\;
% Track the minimum (best) evaluation
\diffMin $\leftarrow$ $\infty$\;
% Track the optimal sequence (the one with the best evaluation)
\opts $\leftarrow$ $[]$\;
\ForEach{\opt $\in$ \getOpts{$T, L$}}{ \label{algo:lmdk-sel-opt-for-each}
\evalCur $\leftarrow 0$\;
\ForEach{\opti $\in$ \opt}{
\evalCur $\leftarrow$ \evalCur $+$ \evalSeq{$T, \opti, L$}/\#\opt\; \label{algo:lmdk-sel-opt-comparison}
}
% Compare with current optimal
\diffCur $\leftarrow \left|\evalCur - \evalOrig\right|$\;
\If{\diffCur $<$ \diffMin}{
\diffMin $\leftarrow$ \diffCur\;
\opts $\leftarrow$ \opt\;
}
} \label{algo:lmdk-sel-opt-end}
\Return{\opts}
\end{algorithm}
Algorithm~\ref{algo:lmdk-sel-opt} guarantees to return the optimal option with regard to the original set $L$.
However, it is rather costly in terms of complexity.
In more detail, given $|T \setminus L|$ regular events and a combination of size $r$, it requires $O(C(|T \setminus L|, r) + 2^{C(|T \setminus L|, r)})$ time and $O(r*C(|T \setminus L|, r))$ space.
Next, we present a \texttt{Heuristic} solution with improved time and space requirements.
\paragraph{\texttt{Heuristic}}
The \texttt{Heuristic} algorithm (Algorithm~\ref{algo:lmdk-sel-heur}) follows an incremental methodology and at each step it selects a new timestamp, corresponding to a regular event from $T \setminus L'$.
In this case, the elements of $L'$ at each step differ by one from the one that the algorithm selected in the previous step.
Similar to the \texttt{Optimal}, it selects a new set based on a predefined similarity metric until it selects a set that is equal to the size of the series of events, i.e.,~$L' = T$.
\begin{algorithm}
\caption{\texttt{Heuristic} dummy {\thething} set options generation}
\label{algo:lmdk-sel-heur}
\DontPrintSemicolon
\KwData{the time series timestamps $T$, the {\thething} set $L$}
\KwResult{the selected {\thething} set options \opts}
\BlankLine
% Evaluate the original
\evalOrig $\leftarrow$ \evalSeq{$T, \emptyset, L$}\;
% Get all possible option combinations
\opts $\leftarrow$ $[]$\;
$L' \leftarrow L$\;
\While{$L' \neq T$}{\label{algo:lmdk-sel-heur-while}
% Track the minimum (best) evaluation
\diffMin $\leftarrow$ $\infty$\;
\optimi $\leftarrow$ Null\;
% Find the combinations for one more point
\ForEach{\reg $\in T \setminus L'$}{
% Evaluate current
\evalCur $\leftarrow$ \evalSeq{$T, \reg, L'$}\; \label{algo:lmdk-sel-heur-comparison}
% Compare evaluations
\diffCur $\leftarrow$ $\left|\evalCur - \evalOrig\right|$\;
\If{\diffCur $<$ \diffMin}{
\diffMin $\leftarrow$ \diffCur\;
\optimi $\leftarrow$ \reg\;
}\label{algo:lmdk-sel-heur-cmp-end}
}
% Save new point to landmarks
$L'$.add(\optimi)\;
% Add new option
\opts.append($L' \setminus L$)\;
}\label{algo:lmdk-sel-heur-end}
\Return{\opts}
\end{algorithm}
Similar to Algorithm~\ref{algo:lmdk-sel-opt}, it selects new options based on a predefined metric (Lines~{\ref{algo:lmdk-sel-heur-comparison}-\ref{algo:lmdk-sel-heur-cmp-end}}).
This process (Lines~{\ref{algo:lmdk-sel-heur-while}-\ref{algo:lmdk-sel-heur-end}}) goes on until we select a set that is equal to the size of the series of events, i.e.,~$L' = T$.
In terms of complexity, given $|T \setminus L|$ regular events, the \texttt{Heuristic} requires $O(|T \setminus L|^2)$ time and space.
Note that the reverse process, i.e.,~starting with $T$ {\thethings} and removing until $|L'| = |L| + 1$, performs similarly.
\paragraph{\texttt{Partitioned}}
We improve the complexity of the \texttt{Heuristic} algorithm by partitioning the {\thething} timestamp sequence $L$.
The novelty of this algorithm lies in the fact that it deals with the event series as a histogram which allows it to take advantage of its relevant features and methodology.
Particularly, it uses the Freedman-Diaconis rule, which is resilient to outliers and takes into account the data variability and data size~\cite{meshgi2015expanding}, and generates a histogram from the {\thething} set $L$.
This way, it achieves an improved complexity, compared to the \texttt{Heuristic}, that is dependent on the histogram's bin size.
Algorithm~\ref{algo:lmdk-sel-hist} demonstrates the overall process.
\begin{algorithm}
\caption{\texttt{Partitioned} dummy {\thething} set options generation}
\label{algo:lmdk-sel-hist}
\DontPrintSemicolon
\KwData{the time series timestamps $T$, the {\thething} set $L$}
\KwResult{the selected {\thething} set options \opts}
% \kat{verify description of variables}
% \mk{OK}
\BlankLine
\hist, \h $\leftarrow$ \getHist{$T, L$}\;
\histCur $\leftarrow$ \hist\;
\opts $\leftarrow$ $[]$\;
% \kat{L' not defined..}
% \mk{It was histCur}
\While{\sumHist{\histCur} $\neq$ \len{$T$}}{
\label{algo:lmdk-sel-hist-while}
\diffMin $\leftarrow$ $\infty$\; % \tcp*{Track the best evaluation}
\opt $\leftarrow$ \histCur\; % \tcp*{The candidate option}
\ForEach{\hi \textnormal{\textbf{in}} \histCur}{ % \tcp*{Repeat for every bin}
\label{algo:lmdk-sel-hist-cmp-start}
\If{\hi $+$ $1$ $\leq$ \h}{ % \tcp*{Can we add one more point?}
\histTmp $\leftarrow$ \histCur\;
{\histTmp}[$i$] $\leftarrow$ {\histTmp}[$i$] $+$ $1$\;
\diffCur $\leftarrow$ \getDiff{\hist, \histTmp}\; % \tcp*{Find difference from original}
\label{algo:lmdk-sel-hist-getDiff}
\If{\diffCur $<$ \diffMin}{ % \tcp*{Remember if it is the best that you've seen}
\label{algo:lmdk-sel-hist-cmp}
\diffMin $\leftarrow$ \diffCur\;
\opt $\leftarrow$ \histTmp\;
}
}
} \label{algo:lmdk-sel-hist-cmp-end}
\histCur $\leftarrow$ \opt\; % \tcp*{Update current histogram}
\opts $\leftarrow$ \opt\; % \tcp*{Add current best to options}
} \label{algo:lmdk-sel-hist-end}
\Return{\opts}
\end{algorithm}
Function \getHist generates a histogram with bins of size \h for a given time series timestamps $T$ and {\thething} set $L$.
For every new histogram version, the \getDiff function (Line~\ref{algo:lmdk-sel-hist-getDiff}) finds the difference from the original histogram; for this operation it utilizes the Euclidean distance~(see Section~\ref{subsec:sel-utl} for more details).
In Lines~{\ref{algo:lmdk-sel-hist-cmp-start}-\ref{algo:lmdk-sel-hist-cmp-end}}, the algorithm checks every histogram version by incrementing each bin by $1$ and comparing it to the original (Line~\ref{algo:lmdk-sel-hist-cmp}).
In the end, it returns \opts which contains all the versions of \hist that are closest to the original \hist for all possible bin sizes of \hist.
\subsubsection{Privacy-preserving option selection}
\label{subsec:lmdk-opt-sel}
The algorithms that we presented in Section~\ref{subsec:lmdk-set-opts} return a set of possible versions of the original {\thething} set $L$ by adding extra timestamps in it from the series of events at timestamps $T \setminus L$.
In the next step, we randomly select a set by utilizing the exponential mechanism (Section~\ref{subsec:prv-mech}).
For this procedure, we allocate a small fraction of the available privacy budget, i.e.,~$1$\% or even less (see Section~\ref{subsec:sel-eps} for more details), which adds up to that of the publishing scheme according to Theorem~\ref{theor:compo-seq-ind}.
\paragraph{Utility score function}
Prior to selecting a {\thething} timestamp set including the original along with dummy {\thethings}, the exponential mechanism evaluates each set using a utility score function.
We present here two ways of doing so.
One way to evaluate each set is by taking into account the temporal position of the events in the sequence.
Events that occur at recent timestamps are more likely to reveal sensitive information regarding the users involved~\cite{kellaris2014differentially}.
Hence, indicating the existence of dummy {\thethings} nearby actual {\thethings} can increase the adversarial confidence regarding the location of the latter within a series of events.
In other words, sets with dummy {\thethings} with less average temporal distance from actual {\thethings} achieve better utility scores.
Another approach for the utility score function is to consider the number of events in each set.
Sets with more dummy {\thethings} may render actual {\thethings} more indistinguishable, and therefore provide less utility.
Consequently, more dummy {\thethings} lead to distributing the privacy budget to more events, and therefore leading to more robust overall privacy protection.
\paragraph{Option release}
In the last step, the privacy-preserving dummy {\thething} selection module releases a new {\thething} set (including the original {\thethings} along with the dummy ones) from the options that were generated in the previous step, by utilizing the exponential mechanism.
The options generated by the \texttt{Optimal} and \texttt{Heuristic} algorithms contain actual timestamps that can be utilized directly by the {\thething} privacy schemes that we presented in Section~\ref{subsec:lmdk-sol}.
However, the \texttt{Partitioned} algorithm returns histograms instead of timestamps.
Therefore, we need to process the result of the exponential mechanism further by sampling without replacement from the set $T \setminus L$ according to the selected histogram's probability density function.