the-last-thing/discussion.tex
2019-03-05 19:01:18 +01:00

38 lines
4.5 KiB
TeX

\subsection{Discussion}
\label{subsec:discussion}
In the previous sections we provided a brief summary and review for each work that falls into the categories of Microdata and Statistical Data privacy preserving publication under continual data schemes.
The main elements that have been summarized in Table~\ref{tab:related} allow us to make some interesting observations, on each category individually, and more generally.
In the Statistical Data section, all of the works deal with data linkage attacks, while there are some more recent works taking into consideration possible data correlations as well.
We notice that data linkage is currently assumed in the bibliography as the worst case scenario.
For this reason, works in the Statistical Data category provide a robust privacy protection solution independent to the adversaries' knowledge.
The prevailing distortion method in this category is probabilistic perturbation.
This is justified by the fact that nearly all of the observed methods are based on differential privacy.
The majority implements the Laplace mechanism, while some of them offer an adaptive approach.
In the Microdata category we observe that problems with sequential data, i.e.,~data that are generated in a sequence and dependent on the values in previous data sets, are more prominent.
It is important to note that works on this set of problems actually followed similar scenarios, i.e.,~publishing updated versions of an original data set, either vertically (schema-wise) or horizontally (tuple-wise).
Naturally, in such cases the most evident attack scenarios are the complementary release ones, as in each release there is great probability that there will be an intersection of tuples with previous releases.
On the other hand, when the problem has stream data/processing, we observe that these data are location specific, most commonly trajectories.
In such cases, the attacks considered are wider (than only versions of an original data set), taking into account external information, e.g.,~correlations that typically may be available for location specific data.
Speaking of correlations, in either category, we may see that the protection method used is mainly probabilistic, if not total suppression.
This makes sense, since by generalization the correlation between attributes would not be canceled.
Generalization is used naturally on grouped-based techniques, to make it possible to group more tuples under the generated categories --- and thus achieve anonymization.
As far as the protection levels are concerned, the Microdata category mainly targets event level protection, as all users are protected equally through the performed grouping.
Still, scenarios that contain trajectories, associated with a certain user aim to protect this user's privacy by blurring the actual trajectories (user-level).
$w-$event level is absent in the Microdata category; one reason maybe that streaming scenarios are not prominent in this category, and another practical reason may be that this notion was introduced later in time.
Indeed, none of the works in the Microdata category explicitly mention the level of privacy, as these levels have been introduced in differential privacy scenarios, hence in Statistical Data.
Considering all the use cases from both categories, event-level protection is more prominent, as it is more practical to protect all the users as a single set than each one individually in continual settings.
As already discussed, problems with streaming processing are not common in the Microdata category.
Indeed, most of the cases including streaming scenarios are in the Statistical Data category.
A technical reason behind this observation is that anonymizing a raw data set as a whole, may be a time-consuming process, and thus, not well-suited for streaming.
The complexity actually depends on the number of attributes, if we consider the possible combinations that may be enumerated.
On the contrary, aggregation functions as used in the Statistical Data category, especially in the absence of filters, usually are low cost.
Moreover, perturbing a numerical values (the usual result of an aggregation function) does not add a lot in the complexity of the algorithm (depending of course on the perturbation model used).
For this reason, perturbing the result of a process is more time efficient than anonymizing the data set and then running the process on the anonymized data.
Still, we may argue that an anonymized data set can be more widely used; in the case of statistical data it is only the data holder that performs the processes and releases the results.