From 423b74b6354d5d12156012aa1a3434cb406301df Mon Sep 17 00:00:00 2001
From: Manos <manos@delkappa.com>
Date: Tue, 12 Oct 2021 04:21:46 +0200
Subject: [PATCH] evaluation: lmdk-sel-sol

---
 text/problem/theotherthing/solution.tex | 102 ++++++++++++------------
 1 file changed, 52 insertions(+), 50 deletions(-)

diff --git a/text/problem/theotherthing/solution.tex b/text/problem/theotherthing/solution.tex
index 7a7bfa3..54b7e49 100644
--- a/text/problem/theotherthing/solution.tex
+++ b/text/problem/theotherthing/solution.tex
@@ -2,40 +2,37 @@
 \label{subsec:lmdk-sel-sol}
 
 The main idea of the privacy-preserving {\thething} selection component is to privately select extra {\thething} event timestamps, i.e.,~dummy {\thethings}, from the set of timestamps $T /\ L$ of the time series $S_T$ and add them to the original {\thething} set $L$.
+Selecting extra events, on top of the actual {\thethings}, as dummy {\thethings} can render actual ones indistinguishable.
+The goal is to select a list of sets with additional timestamps from a series of events at timestamps $T$ for a set of {\thethings} at $L \subseteq T$.
 Thus, we create a new set $L'$ such that $L \subset L' \subseteq T$.
-We generate a set of dummy {\thething} set options by adding regular event timestamps from $T /\ L$ to $L$ (Section~\ref{subsec:lmdk-set-opts}).
-Then (Section~\ref{subsec:lmdk-opt-sel}), we utilize the exponential mechanism, with a utility function that calculates an indicator for each of the options in the set based on how much it differs from the original {\thething} set $L$, and randomly select one ot the options that we created earlier.
+
+First, we generate a set of dummy {\thething} set options by adding regular event timestamps from $T /\ L$ to $L$ (Section~\ref{subsec:lmdk-set-opts}).
+Then, we utilize the exponential mechanism, with a utility function that calculates an indicator for each of the options in the set based on how much it differs from the original {\thething} set $L$, and randomly select one of the options (Section~\ref{subsec:lmdk-opt-sel}).
 This process provides an extra layer of privacy protection to {\thethings}, and thus allows the release, and thereafter processing, of {\thething} timestamps.
 
 % We utilize the exponential mechanism with a utility function that calculates an indicator for each of the options in the set that we selected in the previous step.
 % The utility depends on the positioning of the {\thething} timestamps of an option in the series, e.g.,~the distance from the previous/next {\thething}, the distance from the start/end of the series, etc.
 
 
-\subsubsection{{\Thething} set options}
+\subsubsection{{\Thething} set options generation}
 \label{subsec:lmdk-set-opts}
 
-This step aims to select a set of candidate {\thething} timestamps options either by randomizing the actual timestamps (Section~\ref{subsec:lmdk-rnd}), or by inserting dummy timestamps (Section~\ref{subsec:lmdk-dum-gen}) to the actual {\thething} timestamps.
-
-
-\paragraph{Dummy {\thething} generation}
-\label{subsec:lmdk-dum-gen}
-
-Selecting extra events, on top of the actual {\thethings}, as dummy {\thethings} can render actual ones indistinguishable.
-The goal is to select a list of sets with additional timestamps from a series of events at timestamps $\{t_n\}$ for a set of {\thethings} at $\{l_k\} \subseteq \{t_n\}$.
 Algorithms~\ref{algo:lmdk-sel-opt} and \ref{algo:lmdk-sel-heur} approach this problem with an optimal and heuristic methodology, respectively.
+Function \evalSeq evaluates the result of the union of $L$ and a timestamp combination from $T \setminus L$ by, e.g.,~estimating the standard deviation of all the distances from the previous/next {\thething}.
+\getOpts returns all the possible \emph{valid} sets of combinations \opt such that larger options contain all of the timestamps that are present in smaller ones.
+Each combination contains a set of timestamps with sizes $\left|L\right| + 1, \left|L\right| + 2, \dots, \left|T\right|$, where each one of them is a combination of $L$ with $x \in [1, \left|T\right| - \left|L\right|]$ timestamps from $T$.
 
-Function \calcMetric measures an indicator for the union of $\{l_k\}$ and a timestamp combination from $\{t_n\} \setminus \{l_k\}$.
-Function \evalSeq evaluates the result of \calcMetric by, e.g.,~estimating the standard deviation of all the distances from the previous/next {\thething}.
-Function \getOpts returns all possible \emph{valid} sets of combinations \opt such that $\{l_{k+i}\} \subset \{l_{k+j}\}, \forall i, j \in [k, n] \mid i < j$, i.e.,~larger options must contain all of the timestamps that are present in smaller ones.
-Each combination contains a set of timestamps with sizes $k + 1, k + 2, \dots, n$, where each one of them is a combination of $\{l_k\}$ with $x \in [1, n - k]$ timestamps from $\{t_n\}$.
+\paragraph{Optimal}
+Algorithm~\ref{algo:lmdk-sel-opt}, between Lines~{\ref{algo:lmdk-sel-opt-for-each}--\ref{algo:lmdk-sel-opt-end}} evaluates each option in \opts.
+It finds the option that is the most \emph{similar} to the original (Lines~{\ref{algo:lmdk-sel-opt-comparison}-\ref{algo:lmdk-sel-opt-end}}), i.e.,~the option that has an evaluation that differs the least from that of the sequence $T$ with {\thethings} $L$.
 
 \begin{algorithm}
-  \caption{Optimal dummy {\thething} set options selection}
+  \caption{Optimal dummy {\thething} set options generation}
   \label{algo:lmdk-sel-opt}
 
   \DontPrintSemicolon
 
-  \KwData{$\{t_n\}, \{l_k\}$}
+  \KwData{$T, L$}
 
   \SetKwInput{KwData}{Input}
 
@@ -43,11 +40,10 @@ Each combination contains a set of timestamps with sizes $k + 1, k + 2, \dots, n
   \BlankLine
 
   % Evaluate the original
-  \metricOrig $\leftarrow$ \calcMetric{$\{t_n\}, \emptyset, \{l_k\}$}\;
-  \evalOrig $\leftarrow$ \evalSeq{\metricOrig}\;
+  \evalOrig $\leftarrow$ \evalSeq{$T, \emptyset, L$}\;
 
   % Get all possible option combinations
-  \opts $\leftarrow$ \getOpts{$\{t_n\}, \{l_k\}$}\;
+  \opts $\leftarrow$ \getOpts{$T, L$}\;
 
   % Track the minimum (best) evaluation
   \diffMin $\leftarrow$ $\infty$\;
@@ -55,25 +51,29 @@ Each combination contains a set of timestamps with sizes $k + 1, k + 2, \dots, n
   % Track the optimal sequence (the one with the best evaluation)
   \optim $\leftarrow$ $[]$\;
 
-  \ForEach{\opt $\in$ \opts}{\label{algo:lmdk-sel-opt-for-each}
-    \evalSum $\leftarrow 0$\;
+  \ForEach{\opt $\in$ \opts}{ \label{algo:lmdk-sel-opt-for-each}
+    \evalCur $\leftarrow 0$\;
     \ForEach{\opti $\in$ \opt}{
-      \metricCur $\leftarrow$ \calcMetric{$\{t_n\}, \opti, \{l_k\}$}\;\label{algo:lmdk-sel-opt-comparison}
-      \evalSum $\leftarrow$ \evalSum $+$ \evalSeq{\metricCur}\;
-
-      % Compare with current optimal
-      \diffCur $\leftarrow \left|\evalSum/\#\opt - \evalOrig\right|$\;
-      \If{\diffCur $<$ \diffMin}{
-        \diffMin $\leftarrow$ \diffCur\;
-        \optim $\leftarrow$ \opt\;
-      }
+      \evalCur $\leftarrow$ \evalCur $+$ \evalSeq{$T, \opti, L$}/\#\opt\; \label{algo:lmdk-sel-opt-comparison}
     }
-  }\label{algo:lmdk-sel-opt-end}
+    % Compare with current optimal
+    \diffCur $\leftarrow \left|\evalCur - \evalOrig\right|$\;
+    \If{\diffCur $<$ \diffMin}{
+      \diffMin $\leftarrow$ \diffCur\;
+      \optim $\leftarrow$ \opt\;
+    }
+  } \label{algo:lmdk-sel-opt-end}
   \Return{\optim}
 \end{algorithm}
 
-Algorithm~\ref{algo:lmdk-sel-opt}, in particular, between Lines~{\ref{algo:lmdk-sel-opt-for-each}-\ref{algo:lmdk-sel-opt-end}} evaluates each option in \opts.
-It finds the option that is the most \emph{similar} to the original (Lines~{\ref{algo:lmdk-sel-opt-comparison}-\ref{algo:lmdk-sel-opt-end}}), i.e.,~the option that has an evaluation that differs the least from that of the sequence $\{t_n\}$ with {\thethings} $\{l_k\}$.
+Algorithm~\ref{algo:lmdk-sel-opt} guarantees to return the optimal set of dummy {\thethings} with regard to the original set $L$.
+However, it is rather costly in terms of complexity: given $n$ regular events and a combination of size $r$, it requires $\mathcal{O}(C(n, r) + 2^C(n, r))$ time and $\mathcal{O}(r*C(n, r))$ space.
+Next, we present a heuristic solution with improved time and space requirements.
+
+
+\paragraph{Heuristic}
+Algorithm~\ref{algo:lmdk-sel-heur}, follows an incremental methodology.
+At each step it selects a new timestamp that corresponds to a regular ({non-\thething}) event from $T \setminus L$.
 
 \begin{algorithm}
   \caption{Heuristic dummy {\thething} set options selection}
@@ -81,30 +81,28 @@ It finds the option that is the most \emph{similar} to the original (Lines~{\ref
 
   \DontPrintSemicolon
 
-  \KwData{$\{t_n\}, \{l_k\}$}
+  \KwData{$T, L$}
   \KwResult{\optim}
   \BlankLine
 
   % Evaluate the original
-  \metricOrig $\leftarrow$ \calcMetric{$\{t_n\}, \emptyset, \{l_k\}$}\;
-  \evalOrig $\leftarrow$ \evalSeq{\metricOrig}\;
+  \evalOrig $\leftarrow$ \evalSeq{$T, \emptyset, L$}\;
 
   % Get all possible option combinations
   \optim $\leftarrow$ $[]$\;
 
-  $\{l_{k'}\} \leftarrow \{l_k\}$\;
+  $L' \leftarrow L$\;
 
-  \While{$\{l_{k'}\} \neq \{t_n\}$}{\label{algo:lmdk-sel-heur-while}
+  \While{$L' \neq T$}{\label{algo:lmdk-sel-heur-while}
     % Track the minimum (best) evaluation
     \diffMin $\leftarrow$ $\infty$\;
 
-    \optimi $\leftarrow$ $0$\;
+    \optimi $\leftarrow$ Null\;
     % Find the combinations for one more point
-    \ForEach{\reg $\in \{t_n\} \setminus \{l_{k'}\}$}{
+    \ForEach{\reg $\in T \setminus L'$}{
 
       % Evaluate current
-      \metricCur $\leftarrow$ \calcMetric{$\{t_n\}, \reg, \{l_{k'}\}$}\;\label{algo:lmdk-sel-heur-comparison}
-      \evalCur $\leftarrow$ \evalSeq{\metricCur}\;
+      \evalCur $\leftarrow$ \evalSeq{$T, \reg, L'$}\; \label{algo:lmdk-sel-heur-comparison}
 
       % Compare evaluations
       \diffCur $\leftarrow$ $\left|\evalCur - \evalOrig\right|$\;
@@ -116,27 +114,31 @@ It finds the option that is the most \emph{similar} to the original (Lines~{\ref
     }
 
     % Save new point to landmarks
-    $k' \leftarrow k' + 1$\;
-    $l_{k'} \leftarrow \optimi$\;
+    $L'.add(\optimi)$\;
 
     % Add new option
-    \optim.add($\{l_{k'}\} \setminus \{l_k\}$)\;
+    \optim.append($L' \setminus L$)\;
   }\label{algo:lmdk-sel-heur-end}
 
   \Return{\optim}
 \end{algorithm}
 
-Algorithm~\ref{algo:lmdk-sel-heur}, follows an incremental methodology.
-At each step it selects a new timestamp that corresponds to a regular ({non-\thething}) event from $\{t_n\} \setminus \{l_k\}$.
 Similar to Algorithm~\ref{algo:lmdk-sel-opt}, the selection is done based on a predefined metric (Lines~{\ref{algo:lmdk-sel-heur-comparison}-\ref{algo:lmdk-sel-heur-comparison-end}}).
-This process (Lines~{\ref{algo:lmdk-sel-heur-while}-\ref{algo:lmdk-sel-heur-end}}) goes on until we select a set that is equal to the size of the series of events, i.e.,~$\{l_{k'}\} = \{t_n\}$.
+This process (Lines~{\ref{algo:lmdk-sel-heur-while}-\ref{algo:lmdk-sel-heur-end}}) goes on until we select a set that is equal to the size of the series of events, i.e.,~$L' = T$.
 
-Note that the reverse heuristic approach, i.e.,~starting with $\{t_n\}$ {\thethings} and removing until $\{l_k\}$, performs worse than and occasionally the same with Algorithm~\ref{algo:lmdk-sel-heur}.
+In terms of complexity: given $n$ regular events it requires $\mathcal{O}(n^2)$ time and space.
+Note that the reverse heuristic approach, i.e.,~starting with $T$ {\thethings} and removing until $L$, performs similarly with Algorithm~\ref{algo:lmdk-sel-heur}.
+
+
+
+\mk{WIP: Histograms}
 
 
 \subsubsection{Privacy-preserving option selection}
 \label{subsec:lmdk-opt-sel}
 
+\mk{WIP}
+
 % Nearby events
 Events that occur at recent timestamps are more likely to reveal sensitive information regarding the users involved~\cite{kellaris2014differentially}.
 Thus, taking into account more recent events with respect to {\thethings} can result in less privacy loss and better privacy protection overall.