the-last-thing/text/related/statistical.tex

357 lines
32 KiB
TeX

\section{Statistical data}
\label{sec:statistical}
When continuously publishing statistical data, usually in the form of counts, the most widely used privacy method is differential privacy, or derivatives of it, as witnessed in Table~\ref{tab:statistical}.
In theory differential privacy makes no assumptions about the background knowledge available to the adversary.
In practice, as we observe in Table~\ref{tab:statistical}, data dependencies (e.g.,~correlations) arising in the continuous publication setting are frequently (but without it being the rule) considered as attacks in the proposed algorithms.
\includetable{statistical}
\subsection{Finite observation}
\label{subsec:statistical-finite}
% Practical differential privacy via grouping and smoothing
% - statistical (counts)
% their scenario is built on location data (check-ins)
% - finite
% - batch
% - linkage
% - event
% - differential privacy
% - perturbation (Laplace)
\hypertarget{kellaris2013practical}{Kellaris et al.}~\cite{kellaris2013practical} pointed out that in time series, where users might contribute to an arbitrary number of aggregates, the sensitivity of the query answering function is significantly influenced by their presence/absence in the data set.
Thus, the Laplace perturbation algorithm, commonly used with differential privacy, may produce meaningless data sets.
Furthermore, under such settings, the discrete Fourier transformation of the Fourier perturbation algorithm (another popular technique for data perturbation) may behave erratically, and affect the utility of the outcome of the mechanism.
For this reason, the authors proposed their own method involving grouping and smoothing for one-time publishing of time series of non-overlapping counts, i.e.,~the aggregated data of one count does not affect any other count.
Grouping includes partitioning the data set into similar clusters.
The size and the similarity measure of the clusters are data dependent.
Random grouping consumes less privacy budget, as there is minimum interaction with the original data.
However, when using a grouping technique based on sampling, which has some privacy cost but produces better groups, the impact of the perturbation is decreased.
During the smoothing phase, the average values for each cluster are calculated, and finally, Laplace noise is added to these values.
In this way, the query sensitivity becomes less dependent on each individual's data, and therefore less perturbation is required.
% Differentially private sequential data publication via variable-length n-grams
% - statistical (trajectories)
% - finite
% - batch
% - linkage
% - user
% - differential privacy
% - perturbation (adaptive Laplace)
\hypertarget{chen2012differentially}{Chen et al.}~\cite{chen2012differentially} exploit a text-processing technique, the \emph{n-gram} model, i.e.,~a contiguous sequence of $n$ items from a given data sample, to release sequential data without releasing the noisy statistics (counts) of all of the possible sequences.
This model allows the publishing of the most common $n$-grams ($n$ is, typically, less than $5$) to accurately reconstruct the original data set.
The privacy technique that the authors propose is suitable for count queries and frequent sequential pattern mining scenarios.
In particular, one of the applications that the authors consider concerns sequential spatiotemporal data (i.e.,~trajectories) of individuals.
They group grams based on the similarity of their $n$ values, construct a search tree, and inject Laplace noise to each node value (count) to achieve user-level differential privacy protection.
Instead of allocating the available privacy budget based on the overall maximum height of the tree, they estimate each path adaptively based on known noisy counts.
The grouping process continues until the desired threshold of $n$ is reached.
Thereafter, they release variable-length $n$-grams with certain thresholds for the values of counts and tree heights, allowing to deal with the trade-off of shorter grams having less information than longer ones but less relative error.
They use a set of consistency constraints, i.e.,~the sum of each node's noisy count has to be less than or equal to its parent's noisy count, and all the noisy counts of leaf nodes have to be within a predefined threshold.
These constraints improve the final data utility since they result in lower values of $n$.
On the one hand, this translates into higher counts, large enough to deal with noise injection and the inherent Markov assumption in the $n$-gram model.
On the other hand, it enhances privacy when the universe of all grams with a lower $n$ value is relatively small resulting in more common sequences, which, nonetheless, is rarely valid in real-life scenarios.
% Differentially private publication of general time-serial trajectory data
% - statistical (trajectories)
% - finite
% - batch
% - linkage
% - user
% - differential privacy
% - perturbation (exponential, Laplace)
\hypertarget{hua2015differentially}{Hua et al.}~\cite{hua2015differentially} use, similar to the scheme proposed in~\cite{chen2012differentially}, the $n$-grams modeling technique for publishing trajectories containing a small number of $n$-grams, thus, sharing few or even no identical prefixes.
They propose a differentially private location-specific generalization algorithm (exponential mechanism), where each position in the trajectory is one record.
The algorithm probabilistically partitions the locations at each timestamp with probability proportional to their Euclidean distance from each other.
They replace each partition with its centroid and therefore, they offer better utility by creating groups of locations belonging to close trajectories.
They optimize the algorithm for time efficiency by using classic $k$-means clustering.
Then, the algorithm releases the new trajectories by observing the generalized location partitions, and their perturbed counts (i.e.,~sum of the same locations at each timestamp) with noise drawn from a Laplace distribution.
The process continues until the total count of the published trajectories reaches the size of the original data set.
They can limit the total number of the possible trajectories by taking into account the individual's moving speed.
The authors have measured the utility of distorted spatiotemporal range queries by measuring the Hausdorff distance from the original results and concluded that the utility deterioration is within reasonable boundaries considering the offered privacy guarantees.
Similar to~\cite{chen2012differentially}, their approach works well for a small location domain.
To make it applicable to realistic scenarios, it is essential to truncate the original trajectories in an effort to reduce the location domain.
This results in a coarse discretization of the location area, leading to the arbitrary distortion of the spatial correlations that are present in the original data set.
% Achieving differential privacy of trajectory data publishing in participatory sensing
% - statistical (trajectories)
% - finite
% - batch
% - linkage
% - user
% - differential privacy
% - perturbation (Laplace)
\hypertarget{li2017achieving}{Li et al.}~\cite{li2017achieving} focus on publishing a set of trajectories, where, contrary to~\cite{hua2015differentially}, each one is considered as a single entry in the data set.
First, using $k$-means clustering they partition the original locations based on their pairwise Euclidean distances.
The scheme represents each location partition by their mean (centroid).
A larger number of partitions, in areas where close centroids exist, results in fewer locations in each partition, and thus lower trajectory precision loss.
Before adding noise, they randomly select partition centroids to generate trajectories until they reach the size of the original data set.
Then, they generate Laplace noise, which they bound according to a set of constraints, and they add it to the count of locations of each point of every trajectory.
Finally, they release the generalized trajectories along with the noisy count of each location point.
The authors prove experimentally that they reduce considerably the trajectory merging time at the expense of utility.
% DPT: differentially private trajectory synthesis using hierarchical reference systems
% - statistical (trajectories)
% - finite
% - batch
% - dependence
% - user
% - differential privacy
% - perturbation (Laplace)
% - spatial correlations (Hierarchical Reference Systems (HRS))
\hypertarget{he2015dpt}{He et al.} present \emph{DPT} (Differentially Private Trajectory)~\cite{he2015dpt}, a system that synthesizes mobility data based on raw, speed-varying trajectories of individuals, while providing $\varepsilon$-differential privacy protection guarantees.
The system constructs a Hierarchical Reference Systems (HRS) model to capture correlations between adjacent locations by imposing a uniform grid at multiple resolutions (i.e.,~for different speed values) over the space, keeping a prefix tree for each resolution, and choosing the centroids as anchor points.
In each reference system, anchor points have a small number of neighboring points with increasing (by a constant factor) average distance between them, and fewer children anchor points as the grid resolution becomes finer.
DPT estimates transition probabilities only for the anchor points in proximity to the last observed location, and chooses the appropriate reference system for each raw point so that the consecutive points of the trajectory are either neighboring anchors or have a parent-child relationship.
The system generates the transition probabilities by estimating the counts in the prefix trees.
Thereafter, it chooses the appropriate prefix trees, perturbs them with noise drawn from the Laplace distribution, and adaptively prunes subtrees with low counts to improve the resulting utility.
DPT implements a direction-weighted sampling postprocessing strategy for the synthetic trajectories to avoid the loss of directionality of the original trajectories due to the perturbation.
Nonetheless, as with all other similar techniques, the usage of prefix trees limits the length of the released trajectories, which results into an uneven spatial distribution.
% Pufferfish Privacy Mechanisms for Correlated Data
% - statistical
% - finite
% - batch
% - dependence
% - unspecified
% - \emph{Pufferfish}
% - perturbation (Laplace)
% - general (Bayesian networks/Markov chains)
\hypertarget{song2017pufferfish}{Song et al.}~\cite{song2017pufferfish} propose the \emph{Wasserstein mechanism}, a technique that applies to any general instantiation of Pufferfish (see Section~\ref{subsec:prv-statistical}).
It adds noise proportional to the sensitivity of a query $F$, which depends on the worst case distance between the distributions $P(F(X)|s_i,d)$ and $P(F(X)|s_j,d)$ for a variable $X$, a pair of secrets $(s_i,s_j)$, and an evolution scenario $d$.
The Wasserstein metric function calculates the worst case distance between those two distributions.
The noise is drawn from a Laplace distribution with parameter equal to the quotient resulting from the division of the maximum Wasserstein distance of the distributions of all the pairs of secrets by the available privacy budget $\varepsilon$.
For optimization purposes, the authors consider a more restricted setting.
This setting, utilizes an evolution scenario for the data correlations representation, and Bayesian networks for the correlation modeling.
The authors state that in cases where Bayesian networks are complex, the Markov chains are a more efficient alternative.
A generalization of the \emph{Markov blanket} mechanism, the \emph{Markov quilt} mechanism, calculates data dependencies.
The dependent nodes of any node consist of its parents, its children, and the other parents of its children.
The present technique excels at data sets generated by monitoring applications or networks, but it is not suitable for online scenarios.
% Differentially private multi-dimensional time series release for traffic monitoring
% - statistical (location)
% - finite
% - streaming
% - dependence
% - user
% - differential privacy
% - perturbation (Laplace)
% - spatiotemporal/serial correlations
\hypertarget{fan2013differentially}{Fan et al.}~\cite{fan2013differentially} propose a real-time framework for releasing differentially private multi-dimensional traffic monitoring data.
At every timestamp, the Perturbation module injects noise drawn from a Laplace distribution to the data.
Then, the Estimation module post-processes the perturbed data to improve the accuracy.
The authors propose a temporal, and spatial estimation algorithm.
The former estimates an internal time series model for each location to improve the utility of the perturbation's outcome by performing a posterior estimation that utilizes Gaussian approximation and Kalman filtering\cite{kalman1960new}.
The latter reduces data sparsity by grouping neighboring locations using a spatial indexing structure based on quadtree.
The Modeling/Aggregation module utilizes domain knowledge, e.g.,~road network and density, and has a bidirectional interaction with the other two in parallel.
Although the authors propose the framework for real-time scenarios, they do not deal with infinite data processing/publication, which limits considerably its applicability.
% An Adaptive Approach to Real-Time Aggregate Monitoring With Differential Privacy
% - statistical
% - finite
% - streaming
% - linkage
% - user
% - differential privacy
% - perturbation (dynamic Laplace)
In another work, \hypertarget{fan2014adaptive}{Fan et al.} designed \emph{FAST}~\cite{fan2014adaptive}, an adaptive system that allows the release of real-time aggregate time series under user-level differential privacy.
These were achieved by using a Sampling, a Perturbation, and a Filtering module.
The Sampling module samples on an adaptive rate the aggregates to be perturbed.
The Perturbation module adds noise to each sampled point according to the allocated privacy budget.
The Filtering module receives the perturbed data point and the original one and generates a posterior estimate, which is finally released.
The error between the perturbed and the released (posterior estimate) point is used to adapt the sampling rate; the sampling frequency is increased when data is going through rapid changes and vice-versa.
Thus, depending on the adjusted sampling rate, not every single data point is perturbed, saving in this way the available privacy budget.
While the system considers the temporal correlations of the processed time series, it does not attempt to deal with the privacy threat that they might pose.
% CTS-DP: publishing correlated time-series data via differential privacy}
% - statistical (they use trajectories in the experiments)
% - finite
% - streaming
% - dependence
% - event
% - differential privacy
% - perturbation (correlated Laplace)
% - serial correlations (autocorrelation function)
\hypertarget{wang2017cts}{Wang and Zu}~\cite{wang2017cts} defined Correlated Time Series Differential Privacy (\emph{CTS-DP}).
The scheme guarantees that the correlation between the perturbation that is introduced by a Correlated Laplace Mechanism (CLM), and the original time series is indistinguishable (Series-Indistinguishability).
CTS-DP deals with the shortcomings of independent and identically distributed (i.i.d.) noise under the presence of correlations.
I.i.d. noise offers inadequate protection, because refinement methods, e.g.,~filtering, can remove it.
Most privacy-preserving methods choose to introduce more noise in the presence of strong correlations thus, diminishing the data utility.
An original and a perturbed time series satisfy Series-Indistinguishability if their normalized autocorrelation functions are the same; hence, the two time series are indistinguishable and the published time series satisfies differential privacy as well.
The authors consider the fact that, in signal processing, if an i.i.d. signal passes through a filter, which consists of a combination of adders and delayers, it becomes non-i.i.d.
Hence, they design CLM, which uses four Gaussian white noise series passed through a linear system, to produce a correlated Laplace noise series according to the autocorrelation function of the original time series.
Although the authors prove experimentally that the implementation of CLM outperforms the current state-of-the-art methods, they do not test its robustness against any filter, which they keep as future work.
\subsection{Infinite observation}
\label{subsec:statistical-infinite}
% Private and continual release of statistics
% - statistical
% - infinite
% - streaming
% - linkage
% - event
% - differential privacy
% - perturbation (Laplace)
\hypertarget{chan2011private}{Chan et al.}~\cite{chan2011private} designed continuous counting mechanisms for finite and infinite data processing and publishing, satisfying $\varepsilon$-differential privacy.
Their main contribution lies in proposing the Binary and Hybrid mechanisms, which do not have any upper bound temporal requirements.
The mechanisms rely on the release of intermediate partial sums of counts at consecutive timestamp intervals, called \emph{p-sums}, and the injection of noise drawn from a Laplace distribution.
The Binary mechanism constructs a binary tree where each node corresponds to a p-sum, and adds noise to each released p-sum proportional to its corresponding length.
The Hybrid mechanism publishes counts at sparse time intervals, i.e.,~timestamps that are a power of $2$.
Both mechanisms offer event-level protection (pan-privacy) under single unannounced and continual announced intrusions by adding a certain amount of noise to every p-sum in memory.
They can facilitate continual top-$k$ queries in recommendation systems, and multidimensional range queries.
Furthermore, they are able to support applications that require a consistent output, i.e.,~at each timestamp the counter increases by either $0$ or $1$.
% Differentially private real-time data release over infinite trajectory streams
% - statistical (spatial)
% - infinite
% - streaming
% - linkage
% - personalized w-event
% - differential privacy
% - perturbation (dynamic Laplace)
\hypertarget{cao2015differentially}{Cao et al.}~\cite{cao2015differentially} developed a framework that achieves personalized \emph{l-trajectory} privacy protection by dynamically adding noise at each timestamp, which exponentially fades over time.
Each individual can specify, in an array of size $l$, the desired protection level for each location of his/her trajectory.
The proposed framework is composed of three components.
The Dynamic Budget Allocation component allocates portions of the privacy budget to the other two components: a fixed one to the Private Approximation, and a dynamic one to the Private Publishing component at each timestamp.
The Private Approximation component estimates, under a utility goal and an approximation strategy, whether it is beneficial to publish approximate data or not.
More precisely, it chooses an appropriate previous noisy data release and republishes it if it is similar to the real statistics planned to be published.
The Private Publishing component takes as inputs the real statistics, and the timestamp of the approximate data, generated by the Private Approximation component, to be republished.
If the timestamp of the approximate data is equal to the current timestamp, then the current data with Laplace noise are published.
Otherwise, the data at the corresponding timestamp of the approximate data will be republished.
The utilized approximation technique is highly suitable for streaming processing, due to the implementation of approximation that can reduce significantly the privacy budget consumption.
However, the framework does not take into account privacy leakage stemming from data dependencies, which limits considerably its applicability in real life data sets.
% Private decayed predicate sums on streams
% - statistical
% - infinite
% - streaming
% - linkage
% - w-event
% - differential privacy
% - perturbation (Laplace)
\hypertarget{bolot2013private}{Bolot et al.}~\cite{bolot2013private} introduce the notion of \emph{decayed privacy} in continual observation of aggregates (sums).
The authors recognize the fact that monitoring applications focus more on recent events, and data, therefore, the value of previous data releases exponentially fades.
This leads to a schema of privacy with expiration, according to which, recent events, and data are more privacy sensitive than those preceding.
Based on this, they apply decayed sum functions for answering sliding window queries of fixed window size $w$ on data streams.
Namely, window sum compute the difference of two running sums, and exponentially decayed and polynomial decayed sums estimate the sum of decayed data.
For every consecutive $w$ data points the algorithm generates binary trees where each node is perturbed with Laplace noise with scale proportional to $w$.
Instead of maintaining a binary tree for every window, the algorithm considers the windows that span two blocks as the union of a suffix and a prefix of two consecutive trees.
This way, the global sensitivity of the query function is kept low.
The proposed techniques are designed for fixed window sizes, hence, when answering multiple sliding window queries with variable window sizes they have to distribute the available privacy budget accordingly.
% Differentially private event sequences over infinite streams
% - statistical
% - infinite
% - streaming
% - linkage
% - w-event
% - differential privacy
% - perturbation (Laplace)
Based on the notion of decayed privacy~\cite{bolot2013private}, \hypertarget{kellaris2014differentially}{Kellaris et al.}~\cite{kellaris2014differentially} defined $w$-event privacy in the setting of periodical release of statistics (counts) in infinite streams.
To achieve $w$-event privacy, the authors propose two mechanisms (Budget Distribution, and Budget Absorption) based on sliding windows, which effectively distribute the privacy budget to sub-mechanisms (one sub-mechanism per timestamp) applied on the data of a window of the stream.
Both algorithms may decide to publish a new noisy count for a specific timestamp, based on the similarity level of the current count with a previously published one.
Moreover, both algorithms have the constraint that the total privacy budget consumed in a window is less than or equal to $\varepsilon$.
The Budget Distribution algorithm distributes the privacy budget in an exponential-fading manner following the assumption that in a window most of the counts remain similar.
The budget of expired timestamps becomes available for the next publications (of next windows).
The Budget Absorption algorithm uniformly distributes from the beginning the budget to the window's timestamps.
A publication uses not only the by-default allocated budget but also the budget of non-published timestamps.
In order to not exceed the limit of $\varepsilon$, adequate number of subsequent timestamps are `silenced' after a publication takes place.
Even though one can argue that $w$-event privacy could be achieved by user-level privacy, it is nevertheless non-practical because of the rigidity of the budget allocation that would finally render the output useless.
% RescueDP: Real-time spatio-temporal crowd-sourced data publishing with differential privacy
% - statistical (spatial)
% - infinite
% - streaming
% - linkage
% - w-event
% - differential privacy
% - perturbation (dynamic Laplace)
% - serial correlations (Pearson's r)
\hypertarget{wang2016rescuedp}{Wang et al.}~\cite{wang2016rescuedp} propose \emph{RescueDP} for the publishing of real-time user-generated spatiotemporal data, utilizing differential privacy with $w$-event-level protection.
RescueDP uses a Dynamic Grouping module to create clusters of regions with small statistics, i.e.,~areas with a small number of samples.
It estimates the similarity of the data trends of these regions by utilizing the Pearson's correlation coefficient, and creates groups accordingly.
The data of each group pass from a Perturbation module that injects Laplace noise to them.
The grouping of the previous phase results into the increase of the sample size of each group of regions, which minimizes the error due to the noise injection.
The implementation of a Kalman Filtering~\cite{kalman1960new} module further increases the utility of the released data.
A Budget Allocation module distributes the available privacy budget to sampling points within any successive $w$ timestamps.
RescueDP saves part of the available privacy budget by approximating the non-sampled data with previously released perturbed data.
During the whole process, an Adaptive Sampling module adjusts the sampling interval according to the difference in the released data statistics over the previous timestamps while taking into account the remaining privacy budget.
% RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response
% - statistical
% - infinite
% - streaming
% - linkage
% - user
% - differential privacy
% - randomization (randomized response)
% - local
\hypertarget{erlingsson2014rappor}{Erlingsson et al.}~\cite{erlingsson2014rappor} presented \emph{RAPPOR} (Randomized Aggregatable Privacy-Preserving Ordinal Response) as a solution for privacy-preserving collection of statistics.
RAPPOR makes all the necessary data processing on the side of the data generators by applying the method of randomized response, which guarantees local differential privacy.
The product of each local privacy-preserving processing is a report that can be represented as a bit string.
Each bit corresponds to a randomized response to a logical predicate on an individual's personal data, e.g.,~categorical properties, numerical and ordinal values, or categories that cannot be enumerated.
Initially, RAPPOR hashes a sensitive value into a Bloom filter~\cite{bloom1970space}.
It creates a binary reporting value, which keeps in its memory (\emph{memoization}) and reuses for future reports (permanent randomized response).
Memoization offers long-term longitudinal privacy protection for privacy-sensitive data values that do not change over time or that are not dependent.
RAPPOR deals with tracking externalities by reporting a randomized version of the permanent randomized response (instantaneous randomized response).
Although this adds an extra layer of randomization to the reported values, it might lead to an averaging attack that may allow an adversary to estimate the true value.
Finally, the authors propose a decoding technique that involves grouping, least-squares solving, and regression.
This way, they effectively make up for the loss of information due to the randomization of the previous steps and allow the extraction of useful information when observing the generated bit strings.
They test their implementation with both simulated and real data, and show that they can extract statistics with high utility while preserving the privacy of the individuals involved.
However, the fact that the privacy guarantees of their technique are valid only for stationary individuals that produce independent data on top of the relatively complex configuration, renders their proposal impractical for many real-world scenarios.
% PrivApprox: privacy-preserving stream analytics
% - statistical
% - infinite
% - streaming
% - linkage
% - event
% - zero-knowledge
% - perturbation (randomized response)
\hypertarget{quoc2017privapprox}{Le Quoc et al.}~\cite{quoc2017privapprox} propose \emph{PrivApprox}, a data analytics system for privacy-preserving stream processing of distributed data sets that combines sampling and randomized response.
The system distributes the analysts' queries to clients via an aggregator and proxies, and employs sliding window computations over batched stream processing to handle the data stream generated by the clients.
The clients transmit a randomized response, after sampling the locally available data, to the aggregator via proxies that apply (XOR-based) encryption.
The combination of sampling and randomized response achieves \emph{zero-knowledge} based privacy, i.e.,~proving that they know a piece of information without in fact disclosing its actual value.
The aggregator collects the received responses and returns statistics to the analysts.
The query model expresses the responses of numerical queries as counts within histogram buckets, whereas, for non-numeric queries it specifies each bucket by a matching rule or a regular expression.
A confidence metric quantifies the results' approximation from the sampling and randomization.
PrivApprox achieves low latency stream processing and enables a synchronization-free distributed architecture that requires low trust to a central entity.
Since it implements a sliding window methodology for infinitely processing series of data sets, it would be purposeful to investigate how to achieve $w$-event-level privacy protection.
% Hiding in the crowd: Privacy preservation on evolving streams through correlation tracking
% - statistical
% - infinite
% - streaming
% - data dependence
% - event
% - randomization
% - perturbation (dynamic)
% - serial correlations (data trends)
\hypertarget{li2007hiding}{Li et al.}~\cite{li2007hiding} attempt to tackle the problem of privacy preservation in numerical data streams taking into account the correlations that may appear continuously among multiple streams and within each one of them.
Firstly, the authors define the utility, and privacy specifications.
The utility of a perturbed data stream is the inverse of the discrepancy between the original and the perturbed measurements.
The discrepancy is set as the normalized Forbenius norm, i.e.,~a matrix norm defined as the square root of the sum of the absolute squares of its elements.
Privacy corresponds to the discrepancy between the original and the reconstructed data stream (from the perturbed one), and consists of the removed noise and the error introduced by the reconstruction.
Then, correlations come into play.
The system continuously monitors the data streams for trends to track correlations, and dynamically perturbs the original numerical data while maintaining the trends that are present.
More specifically, the Streaming Correlated Additive Noise (SCAN) module updates the estimation of the local principal components of the original data, and proportionally distributes noise along the components. Thereafter, the Streaming Correlation Online Reconstruction (SCOR) module removes all the noise by utilizing the best linear reconstruction.
SCOR is a representation of the ability of any adversarial entity to post-process the released data and attempt to reconstruct the original data set by filtering out any distortion.
Overall, the present technique offers robustness against inference attacks by adapting randomization according to data trends, but fails to efficiently quantify the overall privacy guarantee.
% PeGaSus: Data-Adaptive Differentially Private Stream Processing
% - statistical
% - infinite
% - streaming
% - linkage
% - event
% - differential privacy
% - perturbation (Laplace)
\hypertarget{chen2017pegasus}{Chen et al.}~\cite{chen2017pegasus} developed \emph{PeGaSus}, an algorithm for event-level differentially private stream processing that supports different categories of stream queries (counts, sliding window, and event monitoring) over multiple stream resolutions.
It consists of a Perturber, a Grouper, and a Smoother modules.
The Perturber consumes the incoming data stream, adds noise $\varepsilon_p$, which is part of the available privacy budget $\varepsilon$ to each data item, and outputs a stream of noisy data.
The data-adaptive Grouper consumes the original stream and partitions the data into well-approximated regions using, also part of the available privacy budget, $\varepsilon_g$.
Finally, a query specific Smoother combines the independent information produced by the Perturber and the Grouper, and performs post-processing by calculating the final estimates of the Perturber's values for each partition created by the Grouper at each timestamp.
The combination of the Perturber and the Grouper follows the sequential composition and post-processing properties of differential privacy, thus, the resulting algorithm satisfies ($\varepsilon_p + \varepsilon_g$)-differential privacy.