the-last-thing/text/problem/theotherthing/solution.tex

155 lines
7.5 KiB
TeX
Raw Normal View History

2021-10-10 19:49:47 +02:00
\subsection{Protecting {\thethings}}
\label{subsec:lmdk-sel-sol}
2021-10-12 01:40:27 +02:00
The main idea of the privacy-preserving {\thething} selection component is to privately select extra {\thething} event timestamps, i.e.,~dummy {\thethings}, from the set of timestamps $T /\ L$ of the time series $S_T$ and add them to the original {\thething} set $L$.
2021-10-12 04:21:46 +02:00
Selecting extra events, on top of the actual {\thethings}, as dummy {\thethings} can render actual ones indistinguishable.
The goal is to select a list of sets with additional timestamps from a series of events at timestamps $T$ for a set of {\thethings} at $L \subseteq T$.
2021-10-12 01:40:27 +02:00
Thus, we create a new set $L'$ such that $L \subset L' \subseteq T$.
2021-10-12 04:21:46 +02:00
First, we generate a set of dummy {\thething} set options by adding regular event timestamps from $T /\ L$ to $L$ (Section~\ref{subsec:lmdk-set-opts}).
Then, we utilize the exponential mechanism, with a utility function that calculates an indicator for each of the options in the set based on how much it differs from the original {\thething} set $L$, and randomly select one of the options (Section~\ref{subsec:lmdk-opt-sel}).
2021-10-12 01:40:27 +02:00
This process provides an extra layer of privacy protection to {\thethings}, and thus allows the release, and thereafter processing, of {\thething} timestamps.
2021-10-10 19:49:47 +02:00
2021-10-12 01:40:27 +02:00
% We utilize the exponential mechanism with a utility function that calculates an indicator for each of the options in the set that we selected in the previous step.
% The utility depends on the positioning of the {\thething} timestamps of an option in the series, e.g.,~the distance from the previous/next {\thething}, the distance from the start/end of the series, etc.
2021-10-10 19:49:47 +02:00
2021-10-12 04:21:46 +02:00
\subsubsection{{\Thething} set options generation}
2021-10-12 01:40:27 +02:00
\label{subsec:lmdk-set-opts}
2021-10-10 19:49:47 +02:00
Algorithms~\ref{algo:lmdk-sel-opt} and \ref{algo:lmdk-sel-heur} approach this problem with an optimal and heuristic methodology, respectively.
2021-10-12 04:21:46 +02:00
Function \evalSeq evaluates the result of the union of $L$ and a timestamp combination from $T \setminus L$ by, e.g.,~estimating the standard deviation of all the distances from the previous/next {\thething}.
\getOpts returns all the possible \emph{valid} sets of combinations \opt such that larger options contain all of the timestamps that are present in smaller ones.
Each combination contains a set of timestamps with sizes $\left|L\right| + 1, \left|L\right| + 2, \dots, \left|T\right|$, where each one of them is a combination of $L$ with $x \in [1, \left|T\right| - \left|L\right|]$ timestamps from $T$.
2021-10-10 19:49:47 +02:00
2021-10-12 04:21:46 +02:00
\paragraph{Optimal}
Algorithm~\ref{algo:lmdk-sel-opt}, between Lines~{\ref{algo:lmdk-sel-opt-for-each}--\ref{algo:lmdk-sel-opt-end}} evaluates each option in \opts.
It finds the option that is the most \emph{similar} to the original (Lines~{\ref{algo:lmdk-sel-opt-comparison}-\ref{algo:lmdk-sel-opt-end}}), i.e.,~the option that has an evaluation that differs the least from that of the sequence $T$ with {\thethings} $L$.
2021-10-10 19:49:47 +02:00
\begin{algorithm}
2021-10-12 04:21:46 +02:00
\caption{Optimal dummy {\thething} set options generation}
2021-10-10 19:49:47 +02:00
\label{algo:lmdk-sel-opt}
\DontPrintSemicolon
2021-10-12 04:21:46 +02:00
\KwData{$T, L$}
2021-10-10 19:49:47 +02:00
\SetKwInput{KwData}{Input}
\KwResult{\optim}
\BlankLine
% Evaluate the original
2021-10-12 04:21:46 +02:00
\evalOrig $\leftarrow$ \evalSeq{$T, \emptyset, L$}\;
2021-10-10 19:49:47 +02:00
% Get all possible option combinations
2021-10-12 04:21:46 +02:00
\opts $\leftarrow$ \getOpts{$T, L$}\;
2021-10-10 19:49:47 +02:00
% Track the minimum (best) evaluation
\diffMin $\leftarrow$ $\infty$\;
% Track the optimal sequence (the one with the best evaluation)
\optim $\leftarrow$ $[]$\;
2021-10-12 04:21:46 +02:00
\ForEach{\opt $\in$ \opts}{ \label{algo:lmdk-sel-opt-for-each}
\evalCur $\leftarrow 0$\;
2021-10-10 19:49:47 +02:00
\ForEach{\opti $\in$ \opt}{
2021-10-12 04:21:46 +02:00
\evalCur $\leftarrow$ \evalCur $+$ \evalSeq{$T, \opti, L$}/\#\opt\; \label{algo:lmdk-sel-opt-comparison}
2021-10-10 19:49:47 +02:00
}
2021-10-12 04:21:46 +02:00
% Compare with current optimal
\diffCur $\leftarrow \left|\evalCur - \evalOrig\right|$\;
\If{\diffCur $<$ \diffMin}{
\diffMin $\leftarrow$ \diffCur\;
\optim $\leftarrow$ \opt\;
}
} \label{algo:lmdk-sel-opt-end}
2021-10-10 19:49:47 +02:00
\Return{\optim}
\end{algorithm}
2021-10-12 04:21:46 +02:00
Algorithm~\ref{algo:lmdk-sel-opt} guarantees to return the optimal set of dummy {\thethings} with regard to the original set $L$.
However, it is rather costly in terms of complexity: given $n$ regular events and a combination of size $r$, it requires $\mathcal{O}(C(n, r) + 2^C(n, r))$ time and $\mathcal{O}(r*C(n, r))$ space.
Next, we present a heuristic solution with improved time and space requirements.
\paragraph{Heuristic}
Algorithm~\ref{algo:lmdk-sel-heur}, follows an incremental methodology.
At each step it selects a new timestamp that corresponds to a regular ({non-\thething}) event from $T \setminus L$.
2021-10-10 19:49:47 +02:00
\begin{algorithm}
\caption{Heuristic dummy {\thething} set options selection}
\label{algo:lmdk-sel-heur}
\DontPrintSemicolon
2021-10-12 04:21:46 +02:00
\KwData{$T, L$}
2021-10-10 19:49:47 +02:00
\KwResult{\optim}
\BlankLine
% Evaluate the original
2021-10-12 04:21:46 +02:00
\evalOrig $\leftarrow$ \evalSeq{$T, \emptyset, L$}\;
2021-10-10 19:49:47 +02:00
% Get all possible option combinations
\optim $\leftarrow$ $[]$\;
2021-10-12 04:21:46 +02:00
$L' \leftarrow L$\;
2021-10-10 19:49:47 +02:00
2021-10-12 04:21:46 +02:00
\While{$L' \neq T$}{\label{algo:lmdk-sel-heur-while}
2021-10-10 19:49:47 +02:00
% Track the minimum (best) evaluation
\diffMin $\leftarrow$ $\infty$\;
2021-10-12 04:21:46 +02:00
\optimi $\leftarrow$ Null\;
2021-10-10 19:49:47 +02:00
% Find the combinations for one more point
2021-10-12 04:21:46 +02:00
\ForEach{\reg $\in T \setminus L'$}{
2021-10-10 19:49:47 +02:00
% Evaluate current
2021-10-12 04:21:46 +02:00
\evalCur $\leftarrow$ \evalSeq{$T, \reg, L'$}\; \label{algo:lmdk-sel-heur-comparison}
2021-10-10 19:49:47 +02:00
% Compare evaluations
\diffCur $\leftarrow$ $\left|\evalCur - \evalOrig\right|$\;
\If{\diffCur $<$ \diffMin}{
\diffMin $\leftarrow$ \diffCur\;
\optimi $\leftarrow$ \reg\;
}\label{algo:lmdk-sel-heur-comparison-end}
}
% Save new point to landmarks
2021-10-12 04:31:54 +02:00
$L'$.add(\optimi)\;
2021-10-10 19:49:47 +02:00
% Add new option
2021-10-12 04:21:46 +02:00
\optim.append($L' \setminus L$)\;
2021-10-10 19:49:47 +02:00
}\label{algo:lmdk-sel-heur-end}
\Return{\optim}
\end{algorithm}
Similar to Algorithm~\ref{algo:lmdk-sel-opt}, the selection is done based on a predefined metric (Lines~{\ref{algo:lmdk-sel-heur-comparison}-\ref{algo:lmdk-sel-heur-comparison-end}}).
2021-10-12 04:21:46 +02:00
This process (Lines~{\ref{algo:lmdk-sel-heur-while}-\ref{algo:lmdk-sel-heur-end}}) goes on until we select a set that is equal to the size of the series of events, i.e.,~$L' = T$.
2021-10-10 19:49:47 +02:00
2021-10-12 04:21:46 +02:00
In terms of complexity: given $n$ regular events it requires $\mathcal{O}(n^2)$ time and space.
Note that the reverse heuristic approach, i.e.,~starting with $T$ {\thethings} and removing until $L$, performs similarly with Algorithm~\ref{algo:lmdk-sel-heur}.
\mk{WIP: Histograms}
2021-10-10 19:49:47 +02:00
\subsubsection{Privacy-preserving option selection}
2021-10-12 01:40:27 +02:00
\label{subsec:lmdk-opt-sel}
2021-10-10 19:49:47 +02:00
2021-10-12 04:21:46 +02:00
\mk{WIP}
2021-10-10 19:49:47 +02:00
% Nearby events
Events that occur at recent timestamps are more likely to reveal sensitive information regarding the users involved~\cite{kellaris2014differentially}.
Thus, taking into account more recent events with respect to {\thethings} can result in less privacy loss and better privacy protection overall.
This leads to worse data utility.
% Depending on the {\thething} discovery technique
The values of events near a {\thething} are usually similar to that of the latter.
Therefore, privacy-preserving mechanisms are likely to approximate their values based on the nearest {\thething} instead of investing extra privacy budget to perturb their actual values; thus, spending less privacy budget.
Saving privacy budget for releasing perturbed versions of actual event values can bring about better data utility.
% Distant events
However, indicating the existence of randomized/dummy {\thethings} nearby actual {\thethings} can increase the adversarial confidence regarding the location of the latter within a series of events.
Hence, choosing randomized/dummy {\thethings} far from the actual {\thethings} (and thus less relevant) can limit the final privacy loss.