the-last-thing/statistical.tex

\section{Statistical data}
\label{sec:statistical}

When continuously publishing statistical data, usually in the form of counts, the most widely used privacy method is differential privacy, or derivatives of it, as witnessed in Table~\ref{tab:related}. We now continue in reviewing the works in this category. 


% \subsection{Continual data}

% \mk{Nothing to put here.}


\subsection{Data streams}

% Private and continual release of statistics

\hypertarget{chan2011private}{Chan et al.}~\cite{chan2011private} designed a continual counting mechanism satisfying $\varepsilon$-differential privacy with poly-log error. A binary tree is constructed, where each node contains a sum of the counts in its subtree, including noise. It can be used for continual top-k queries in recommendation systems and multidimensional range queries. The mechanism provides guarantees for indefinite runtime without a priori knowledge of an upper temporal bound. It can preserve differential privacy (\emph{pan privacy}) under single or multiple unannounced \emph{intrusions}, i.e.,~snapshots of the mechanism's internal states, by adding a certain amount of noise to each active counter in memory, without incurring any loss in the asymptotic guarantees. The output of the mechanism at every timestamp is a \emph{consistent} approximate integer count, i.e.,~at each time step it increases by either 0 or 1. This makes the mechanism computationally inefficient and not easily applicable in real life scenarios.


% Differentially private real-time data release over infinite trajectory streams

\hypertarget{cao2015differentially}{Cao et al.}~\cite{cao2015differentially} developed a framework that achieves \emph{l-trajectory} protection and enables personalized user privacy, while dynamically adding noise at each timestamp that exponentially fades over time. The user can specify, in an array of size $l$, the desired protection level for each location of his/her trajectory. The proposed framework is composed of three components. As its name indicates, the \emph{Dynamic Budget Allocation} component allocates portions of the privacy budgets to the other two components; a fixed one to the \emph{Private Approximation}, and a dynamic one to the \emph{Private Publishing} component at each timestamp.
The \emph{Private Approximation} component estimates, under a utility goal and an approximation strategy, whether it is beneficial to publish approximate data or not. It chooses an appropriate previous noisy data release and republishes it, if it is similar to the real statistics planned to be published. The \emph{Private Publishing} component takes the real statistics, and timestamp of approximate data as inputs, and releases noisy data using a differential privacy mechanism that adds Laplace noise. If the timestamp of the approximate data is equal to the current timestamp, then the current data with Laplace noise are published. Otherwise, the noisy data at the timestamp of the approximate data will be republished. The utilized approximation technique is highly suitable for streaming processing and can reduce significantly the privacy budget consumption. However, the framework does not take into account privacy leakage stemming from data correlations, fact that limits considerably its applicability in real life.


% Private decayed predicate sums on streams

\hypertarget{bolot2013private}{Bolot et al.}~\cite{bolot2013private} introduce the notion of \emph{decayed privacy} in continual observation of aggregates (sums). The authors recognize the fact that monitoring applications focus more on recent events and data, therefore, the value of previous data releases exponentially fades. This leads to a schema of \emph{privacy with expiration}, according to which, recent events and data are more privacy sensitive than those preceding. Based on this, they apply \emph{decayed sum} functions for answering sliding window queries of fixed window size $w$ on data streams. Namely, (i) \emph{window} sum, which can be reduced to computing the difference of two running sums, and (ii) \emph{exponentially decayed} and (iii) \emph{polynomial decayed} sums, which estimate the sum of decayed data. For every consecutive $w$ data points, binary trees are generated, where, each node is perturbed by injecting Laplace noise with scale proportional to $w$. Instead of maintaining a binary tree for every window, the windows that span two blocks are viewed as the union of a suffix and a prefix of two consecutive trees. The proposed techniques are designed for fixed window sizes, hence, the available privacy budget must be split for answering multiple sliding window queries with various window sizes.


% PrivApprox: privacy-preserving stream analytics

\hypertarget{quoc2017privapprox}{Le Quoc et al.}~\cite{quoc2017privapprox} propose \emph{PrivApprox}, a data analytics system for privacy-preserving stream processing of distributed data sets that combines sampling and randomized response. Analysts' queries are distributed to clients via an aggregator and proxies. A randomized response is transmitted by the clients, who sample the locally available data, to the aggregator via proxies that apply (XOR-based) encryption. The combination of sampling and randomized response achieves \emph{zero-knowledge} based privacy, i.e.,~proving that they know a piece of information without actually disclosing its actual value. The aggregator aggregates the received responses and returns statistics to the analysts. For numerical queries, responses are expressed as counts within histogram buckets, whereas, for non-numeric queries, each bucket is specified by a matching rule or a regular expression. A confidence metric quantifies the results' approximation resulting from the sampling and randomization. The system employs sliding window computations over batched stream processing to handle the data stream generated by the clients. \emph{PrivApprox} achieves low latency stream processing and enables a synchronization-free distributed architecture that requires low trust to a central entity. However, the assumption that released data sets are independent, is rarely true in real life scenarios.


% Hiding in the crowd: Privacy preservation on evolving streams through correlation tracking

\hypertarget{li2007hiding}{Li et al.}~\cite{li2007hiding} attempt to tackle the problem of privacy preservation in data streams by continuously tracking data correlations. Firstly, the authors define utility, and privacy. Utility of a perturbed data stream is the inverse of the \emph{discrepancy} between the original and perturbed measurements. The discrepancy is set as the normalized \emph{Forbenius} norm, i.e.,~a matrix norm defined as the square root of the sum of the absolute squares of its elements. Privacy is the discrepancy between the original and the reconstructed data stream (from the perturbed one), and is comprised by the removed noise and the error introduced by the reconstruction. Then, correlations come into play. The data streams are continuously monitored for new tuples and trends to track correlations, and the system dynamically adds noise accordingly. More specifically, the \emph{Streaming Correlated Additive Noise} (SCAN) module is used to update the estimation of the local principal components of the original data and proportionally distribute noise along the components. Thereafter, the \emph{Streaming Correlation Online Reconstruction} (SCOR) module removes all the noise by utilizing the best linear reconstruction. Overall, the present technique offers robustness against inference attacks by adapting randomization according to data trends, but, fails to quantify the overall privacy guarantee.


% PeGaSus: Data-Adaptive Differentially Private Stream Processing

\hypertarget{chen2017pegasus}{Chen et al.}~\cite{chen2017pegasus} developed \emph{PeGaSus}, an algorithm for event-level differentially private stream processing that supports different categories of stream queries (counts, sliding window, event monitoring) over multiple stream resolutions. It consists of a \emph{perturber}, a \emph{grouper}, and a \emph{smoother} modules. The perturber consumes the incoming data stream, adds noise using part of the available privacy budget $\varepsilon$ to each data item, and outputs a stream of noisy data. The data-adaptive grouper consumes the original stream and partitions the data into well-approximated regions also using part of the available privacy budget. Finally, a query specific smoother combines the independent information produced by the perturber and the grouper, and performs post-processing by calculating the final estimates of the perturber's values for each partition created by the grouper at each timestamp. The combination of the perturber and the grouper follow the sequential composition and post-processing properties of differential privacy, thus, the resulting algorithm satisfies $\varepsilon_p$ + $\varepsilon_g$ = $\varepsilon$-differential privacy. $\varepsilon_p$ is the privacy budget used by the perturber to add noise to the data and $\varepsilon_g$ the corresponding budget used by the grouper to interfere with the user-defined deviation threshold. Nonetheless, the algorithm does not take into account past and/or future releases, thus failing to capture any related privacy leakage.


% Quantifying Differential Privacy under Temporal Correlations

\hypertarget{cao2017quantifying}{Cao et al.}~\cite{cao2017quantifying} propose a method of computing the \emph{temporal privacy leakage} of a differential privacy mechanism in the presence of temporal correlations and background knowledge. The goal of this work is to achieve event-level privacy protection and bound privacy leakage at every single time point. The temporal privacy leakage, is calculated as the sum of the \emph{backward} and \emph{forward privacy leakage} minus the privacy leakage of the mechanism, because it is counted twice in the aforementioned entities. The backward privacy leakage at any time depends on the backward privacy leakage at the previous time point, the temporal correlations, and the traditional privacy leakage of the privacy mechanism. The forward privacy leakage is calculated recursively, i.e.,~for every new time point all the previous time points are re-calculated, therefore increasing the privacy loss in the past. According to the intuition, stronger correlations result in higher privacy leakage. However, the leakage is smaller when the dimension of the transition matrix (modeling the correlations) is larger due to the fact that larger transition matrices tend to be uniform, resulting in weaker correlations.


% Differentially private event sequences over infinite streams

\hypertarget{kellaris2014differentially}{Kellaris et al.}~\cite{kellaris2014differentially} defined $w$-event privacy in the setting of periodical release of statistics (counts) in infinite streams. To achieve $w$-event privacy the authors propose two mechanisms based on sliding windows, which effectively distribute the privacy budget to sub-mechanisms (one sub-mechanism per timestamp) applied on the data of a window of the stream. Both algorithms may decide to publish or not a new noisy count for a specific timestamp, based on the similarity level of the current count with a previously published one. Moreover, both algorithms have the constraint that the total privacy budget consumed in a window is equal or less than $\varepsilon$. However, the first algorithm (Budget Distribution-BD) distributes the privacy budget in a exponential-fading manner following the assumption that in a window most of the counts remain similar. The budget of expired timestamps becomes available for the next publications (of next windows). On the contrary, the second algorithm (Budget Absorption-BA) uniformly distributes from the beginning the budget to the window's timestamps. A publication uses not only the by-default allocated budget but also the budget of non-published timestamps. In order to not exceed the limit of $\varepsilon$, adequate number of subsequent timestamps are `silenced'. 
%Both algorithms are applicable to real life scenarios including traffic and website visit data. 
Even though one can argue that $w$-event privacy could be achieved by user-level privacy, it is nevertheless non practical because of the rigidity of the budget allocation that would finally render the output useless.


% RescueDP: Real-time spatio-temporal crowd-sourced data publishing with differential privacy

\hypertarget{wang2016rescuedp}{Wang et al.}~\cite{wang2016rescuedp} work on the publication of real-time spatiotemporal user-generated data, utilizing differential privacy with $w$-event guarantee. Initially, \emph{RescueDP} performs dynamic \emph{grouping} of regions with small statistics according to the data trends. Then, each group passes from a \emph{perturbation} module that injects Laplace noise. Due to the grouping of the previous phase, the error by perturbation on small statistics can be eliminated, increasing the utility of the resulting statistics. A \emph{budget allocation} module distributes the available privacy budget to sampling points within any successive $w$ timestamps using an adaptive \emph{sampling} module that adjusts according to data dynamics. Non-sampled data are approximated with previously perturbed data, saving part of the available privacy budget. Finally, a \emph{Kalman filtering} module is used to improve the accuracy of the published data.


\subsection{Sequential data}

% Practical differential privacy via grouping and smoothing

\hypertarget{kellaris2013practical}{Kellaris et al.}~\cite{kellaris2013practical} pointed out that in time series, where users might contribute to an arbitrary number of aggregates, the sensitivity of the query answering function is significantly influenced by their presence/absence in the data set. Thus, the \emph{Laplace perturbation algorithm}, commonly used with differential privacy, may produce meaningless data sets. Furthermore, under such settings, the discrete Fourier transformation of the \emph{Fourier perturbation algorithm} may behave erratically and affect the utility of the outcome of the mechanism. Hence, the authors proposed a method involving \emph{grouping} and \emph{smoothing} for one-time publishing of time series of \emph{non-overlapping} counts, i.e.,~each individual contributes to one count at a time. Grouping includes separating the data set into similar clusters. The size and the similarity of the clusters is data dependent. Random grouping consumes less privacy budget, as there is minimum interaction with the original data. However, when using a grouping technique based on sampling, which has some privacy cost but produces better groups, the smoothing perturbation is decreased. During the smoothing phase, the average values for each cluster are calculated and finally, Laplace noise is added. This way, the query sensitivity becomes less dependent on each individual's data and therefore, less perturbation is required.


% Differentially private sequential data publication via variable-length n-grams

\hypertarget{chen2012differentially}{Chen et al.}~\cite{chen2012differentially} exploit a text-processing technique, the \emph{n-gram} model, i.e.,~a contiguous sequence of $n$ items from a given data sample, to retain information of a sequential data set without releasing the noisy counts of all possible sequences. Using this model allows to publish the most common $n$-grams ($n$ is typically smaller than 5) to accurately reconstruct the original data set. Privacy is enhanced by the fact that the universe of all grams with a shorter $n$ value is relatively small resulting in more common sequences. Furthermore, utility is improved by the fact that for small values of $n$ the corresponding counts are large enough to deal with noise injection and the inherent Markov assumption in the $n$-gram model. Variable-length $n$-grams are released with certain thresholds for the values of counts and tree heights, allowing to deal with the trade-off of shorter grams having less information than longer ones, but less relative error. Grams are grouped based on the similarity of their $n$ values, constructing a search tree. The process goes on until reaching the desired maximum $n$ value. Grams with smaller noisy counts have larger relative error thus, lower utility. Instead of allocating the available privacy budget based on the overall maximum height of the tree, each path is adaptively estimated based on known noisy counts. To further improve the final utility, consistency constraints are used, i.e.,~the sum of children's noisy counts has to be less or equal to their parent's noisy count, and noisy counts of leaf nodes should be within a set threshold. The proposed technique is proposed for count query and frequent sequential pattern mining scenarios.


% Differentially private publication of general time-serial trajectory data

\hypertarget{hua2015differentially}{Hua et al.}~\cite{hua2015differentially} tackle the problem of trajectories containing a small number of $n$-grams, thus, sharing few or even no identical prefixes. They propose a differentially private location generalization algorithm (exponential mechanism), for trajectory publishing, where each position in the trajectory is one record. The algorithm probabilistically partitions the locations at each timestamp with regard to their Euclidean distance from each other. Each partition is replaced by its centroid and therefore, locations belonging to closer trajectories are grouped together resulting in better utility. The algorithm is optimized for time efficiency by using classic k-means clustering. Then, the algorithm releases the new trajectories over the generalized location partitions, and their perturbed counts with noise drawn from a Laplace distribution. The process continues until the total count of the published trajectories reaches the size of the original data set. If the user's moving speed is taken into account, the total number of the possible trajectories can be limited. The authors have measured the utility of distorted spatiotemporal range queries by measuring the Hausdorff distance from the original results and concluded that the utility deterioration is within reasonable boundaries considering the offered privacy guarantees.


% Achieving differential privacy of trajectory data publishing in participatory sensing

\hypertarget{li2017achieving}{Li et al.}~\cite{li2017achieving} focus on publishing a set of trajectories where, contrary to~\cite{hua2015differentially}, each one is considered as a single entry in the data set. First, the original locations are partitioned by using k-means clustering based on their pairwise Euclidean distances. Each location partition is represented by their mean (centroid). Larger number of partitions, translates into fewer locations in each partition and thus, smaller trajectory precision loss. Before adding noise to the trajectory number, the original size of the database is approximated by randomly observing the generalized trajectories with the original ones. Then, by using a set of consistency constraints, bounded Laplace noise is generated and added to the number of each trajectory. Finally, the generalized trajectories as well as their noisy counts are released. Although this technique reduces considerably the trajectory merging time, the assumption that all trajectories in the data set are recorded at the same time points does not usually apply in real life use cases.


\subsection{Time series}

% Privacy-utility trade-off under continual observation

\hypertarget{erdogdu2015privacy}{Erdogdu et al.}~\cite{erdogdu2015privacy} consider the scenario where users generate samples at every timestamp from a time series correlated with their sensitive data. Data, that the users have chosen and are willing to privately share to a service provider, are distorted according to a \emph{privacy mapping}, i.e.,~a stochastic process and then, samples are selected for release. A \emph{distortion metric} quantifies the discrepancy of the distorted data from the original. The authors investigate both a simple attack setting where the adversary can make static assumptions only based on the so far observations that cannot be later altered, and a more complex where assumptions are affected dynamically by past and future data releases. In both cases, information leakage at a time point is quantified by a \emph{privacy metric} that measures the improvement of the adversarial inference after observing the data released at that particular point. The goal of the privacy mapping is to find a balance between the distortion and privacy metrics, i.e.,~achieving maximum released data utility while preserving privacy. Throughout the process, both batch and streaming processing schemas are considered. In order to decrease the complexity of streaming processing, the authors propose the utilization of HMMs for data dependency modeling. The assumption that users are privacy-conscious and the fact that typical smart-meter system data include only the total power usage, can drastically limit the applicability of the technique described. Last but not least, there is no proof that the proposed technique is composable.


% Bayesian Differential Privacy on Correlated Data

\hypertarget{yang2015bayesian}{Yang et al.}~\cite{yang2015bayesian} show that privacy is poorer against an adversary who has the least prior knowledge. Correlations may sometimes be negative and thus, the weakest adversary may not correspond to the largest privacy leakage. When data are correlated, according to a Gaussian correlation model, the adversary with the least prior knowledge poses the highest risk of information leakage. This is because the expected variation of the query results is enhanced by the unknown tuples and the correlations with respect to different values of the private individual. The adversaries might have different correlation structures since they could collect information from different sources. Therefore, it is necessary to consider the privacy of correlated data and arbitrary adversaries. To address this necessity, the authors extend the definition of differential privacy based in a Bayesian way, and propose a new \emph{Pufferfish} privacy definition, called \emph{Bayesian differential privacy}, to express the level of private information leakage. Additionally, they designed a general perturbation algorithm that guarantees privacy, taking into account prior knowledge of any subset of tuples in the data, when the data are correlated. Data correlations are transformed in a weighted network with an arbitrary topology structure, where the correlation strength is translated into a weight value. The larger the value of the weight, the more likely is for two tuples to be close, thus, correlated. These networks are described by a Gaussian Markov random field. A Gaussian correlation model is used to accurately describe the structure of data correlations and analyze the Bayesian differential privacy of the perturbation algorithm on the basis of this model. This model is extended to a more general one by adding a prior distribution to each tuple, so that it forms a Gaussian joint distribution on all tuples. The uncertain query answer is connected with the given tuples in a Bayesian way. The perturbation mechanism calculates the potential leakage for the strongest adversaries and applies noise proportional to the maximum privacy leakage coefficient. On the downside, the proposed solution is not suitable for applications that require online processing for real-time statistics.


% Pufferfish Privacy Mechanisms for Correlated Data

\hypertarget{song2017pufferfish}{Song et al.}~\cite{song2017pufferfish} propose the \emph{Wasserstein mechanism}, a technique that can apply to any general instantiation of \emph{Pufferfish}. It adds noise proportional to the \emph{sensitivity} of a query $F$ depending on the worst case distance between the distributions $P(F(X)|s_i,d)$ and $P(F(X)|s_j,d)$ for a variable $X$, a pair of secrets $(s_i,s_j)$, and an evolution scenario $d$. The worst case distance between those two distributions is calculated by the \emph{Wasserstein metric} function. The noise is drawn from a Laplace distribution with parameter equal to the quotient resulting from the division of the maximum Wasserstein distance of the distributions of all the pairs of secrets, by the available privacy budget $\epsilon$. For optimization purposes, the authors consider a more restricted setting, where data correlations, represented by evolution scenario $d$, are modeled by using \emph{Bayesian networks}. Dependencies are calculated by the \emph{Markov quilt mechanism}, a generalization of the \emph{Markov blanket mechanism} where the dependent nodes of any node consist of its parents, its children, and the other parents of its children. The present technique excels at data sets generated by monitoring applications or network, however, it fails to apply in online settings.


% Differentially private multi-dimensional time series release for traffic monitoring

\hypertarget{fan2013differentially}{Fan et al.}~\cite{fan2013differentially} propose a real-time framework for releasing differentially private multi-dimensional traffic monitoring data. Data at every timestamp are injected with noise, drawn from a Laplace distribution, by the \emph{Perturbation} module. The perturbed data are post-processed by the \emph{Estimation} module to produce a more accurate released version. Domain knowledge, e.g.,~road network and density, is utilized by the \emph{Modeling/Aggregation} module in two ways. On one hand, an internal time series model is estimated for each location to improve the utility of perturbation's outcome by performing a posterior estimation that utilizes \emph{Gaussian} approximation and \emph{Kalman} filtering. On the other hand, data sparsity is reduced by grouping neighboring locations based on \emph{Quadtree}. All modules have a bidirectional interaction between them. Although data correlations between timestamps are taken into account to improve the released data utility, the corresponding privacy leakage is not calculated. Furthermore,The adoption of sampling during the data processing could further improve the budget allocation procedure.


% CTS-DP: publishing correlated time-series data via differential privacy}

\hypertarget{wang2017cts}{Wang et al.}~\cite{wang2017cts} defined \emph{CTS-DP}, a correlated time-series data publication method based on differential privacy by enforcing \emph{Series-Indistinguishability} and implementing a \emph{correlated Laplace mechanism (CLM)}. \emph{CTS-DP} deals with the shortcomings of independent and~\emph{identically distributed (IID) noise}. Under the presence of correlations, IID noise offers inadequate protection since by applying refinement methods, e.g.,~filtering, one can remove it. Therefore, more noise must be introduced to make up for the amount of noise that is possible to be removed, thus, diminishing data utility. First, \emph{Series-Indistinguishability} is defined which renders the statistical characteristics of the original and noise series indistinguishable. After the Series-Indistinguishability is defined, the autocorrelation function of the noise series is derived. Second, a CLM uses four Gauss white noise series passed through a linear system to produce a correlated Laplace noise series according to their autocorrelation function. However, the privacy leakage stemming from data correlations is not estimated.


% An Adaptive Approach to Real-Time Aggregate Monitoring With Differential Privacy

\hypertarget{fan2014adaptive}{Fan et al.} propose FAST~\cite{fan2014adaptive}, an adaptive system that allows the release of real-time aggregate time series under user-level differential privacy. These were achieved by using a \emph{sampling}, a \emph{perturbation}, and a \emph{filtering} module. The sampling module samples on an adaptive rate the aggregates to be perturbed. The perturbation module adds noise to each sampled point according to the allocated privacy budget. The filtering module receives the perturbed point and the original one, and generates a posterior estimate, which is finally released. The error between the perturbed and the released (posterior estimate) point is used to adapt the sampling rate; the sampling frequency is increased when data is going through rapid changes and vice-versa. Thus, depending on the adjusted sampling rate, not every single data point is perturbed, saving in this way the available privacy budget. Although, temporal correlations of the processed time series are considered, the corresponding privacy leakage is not calculated.