comments up to 2.2.3 Katerina

2021-08-09 13:45:35 +03:00
parent 04d144bc9d
commit 03f044715b
2 changed files with 13 additions and 12 deletions
--- a/text/preliminaries/main.tex
+++ b/text/preliminaries/main.tex
@ -1,6 +1,7 @@
 \chapter{Preliminaries}
 \label{ch:prel}

+\kat{mention also the different ways data are organized, e.g., as tuples in tables, KVs,  graphs, etc and in what formats you consider them in this work.}
 In this chapter, we introduce some relevant terminology and information around the problem of continuous publishing of privacy-sensitive data sets \kat{the title of the thesis is '..in user generated big data'  not in 'continuous publishing'. Consider rephrase here, and if needed position the user generated big data w.r.t. the continuous publishing so that you continue later on discussing for the continuous publishing setting. }.
 First, in Section~\ref{sec:data}, we categorize user-generated data sets and review data processing in the context of continuous data publishing.
 Second, in Section~\ref{sec:privacy}, we define information disclosure in data privacy. Thereafter, we list the categories of privacy attacks, %identified in the literature,
--- a/text/preliminaries/privacy.tex
+++ b/text/preliminaries/privacy.tex
@ -5,10 +5,10 @@
 \label{subsec:prv-info-dscl}

 When personal data are publicly released, either as microdata or statistical data, individuals' privacy can be compromised, i.e,~an adversary becomes certain about an individual's personal information with a probability higher than a desired threshold.
-In the literature, this compromise is known as \emph{information disclosure} and is usually categorized as~\cite{li2007t, wang2010privacy, narayanan2008robust}:
+In the literature, this compromise\kat{do you want to say 'peril', 'risk' instead of compromise ?} is known as \emph{information disclosure} and is usually categorized as (\cite{li2007t, wang2010privacy, narayanan2008robust}):

 \begin{itemize}
-  \item \emph{Presence disclosure}---the participation (or absence) of an individual in a data set is revealed.
+  \item \emph{Presence disclosure}---the participation or absence of an individual in a data set is revealed.
  \item \emph{Identity disclosure}---an individual is linked to a particular record.
  \item \emph{Attribute disclosure}---new information (attribute value) about an individual is revealed.
 \end{itemize}
@ -27,21 +27,21 @@ Attribute disclosure appears when it is revealed from (a privacy-protected versi

 Information disclosure is typically achieved by combining supplementary (background) knowledge with the released data or by setting unrealistic assumptions while designing the privacy-preserving algorithms.
 In its general form, this is known as \emph{adversarial} or \emph{linkage} attack.
-Even though many works directly refer to the general category of linkage attacks, we distinguish also the following sub-categories, addressed in the literature:
+Even though many works directly refer to the general category of linkage attacks, we distinguish also the following sub-categories:

 \begin{itemize}
-  \item \emph{Sensitive attribute domain knowledge} can result in \emph{homogeneity and skewness} attacks~\cite{machanavajjhala2006diversity,li2007t}, when statistics of the sensitive attribute values are available, and \emph{similarity attack}, when semantics of the sensitive attribute values are available.
-  \item \emph{Complementary release attacks}~\cite{sweeney2002k} with regard to previous releases of different versions of the same and/or related data sets.
+  \item \emph{Sensitive attribute domain knowledge} \kat{sensitive  attribute not defined} can result in \emph{homogeneity and skewness} attacks~\cite{machanavajjhala2006diversity,li2007t}, when statistics of the sensitive attribute values are available, and \emph{similarity attack}, when semantics of the sensitive attribute values are available.
+  \item \emph{Complementary release attacks}~\cite{sweeney2002k} with regard to previous releases of different versions of the same and/or related data sets\kat{please rewrite as a full sentence}.
  In this category, we also identify the \emph{unsorted matching} attack~\cite{sweeney2002k}, which is achieved when two privacy-protected versions of an original data set are published in the same tuple ordering.
-  Other instances include: (i)~the \emph{join} attack~\cite{wang2006anonymizing}, when tuples can be identified by joining (on the (quasi-)identifiers) several releases, (ii)~the \emph{tuple correspondence} attack~\cite{fung2008anonymity}, when in case of incremental data certain tuples correspond to certain tuples in other releases, in an injective way, (iii)~the \emph{tuple equivalence} attack~\cite{he2011preventing}, when tuples among different releases are found to be equivalent with respect to the sensitive attribute, and (iv)~the \emph{unknown releases} attack~\cite{shmueli2015privacy}, when the privacy preservation is performed without knowing the previously privacy-protected data sets.
-  \item \emph{Data dependence} either within one data set or among one data set and previous data releases, and/or other external sources~\cite{kifer2011no, chen2014correlated, liu2016dependence, zhao2017dependent}.
+  Other instances include: (i)~the \emph{join} attack~\cite{wang2006anonymizing}, when tuples can be identified by joining (on the (quasi-)identifiers) \kat{not defined} several releases, (ii)~the \emph{tuple correspondence} attack~\cite{fung2008anonymity}, when in case of incremental data certain tuples correspond to certain tuples in other releases, in an injective way, (iii)~the \emph{tuple equivalence} attack~\cite{he2011preventing}, when tuples among different releases are found to be equivalent with respect to the sensitive attribute, and (iv)~the \emph{unknown releases} attack~\cite{shmueli2015privacy}, when the privacy preservation is performed without knowing the previously privacy-protected data sets.\kat{can you elaborate on the last one?}
+  \item \emph{Data dependence} \kat{please rewrite as a full sentence} either within one data set or among one data set and previous data releases, and/or other external sources~\cite{kifer2011no, chen2014correlated, liu2016dependence, zhao2017dependent}.
  We will look into this category in more detail later in Section~\ref{sec:correlation}.
 \end{itemize}

-The first sub-category of attacks has been mainly addressed in works on snapshot microdata publishing, and is still present in continuous publishing; however, algorithms for continuous publishing  typically accept the proposed solutions for the snapshot publishing scheme (see discussion over $k$-anonymity and $l$-diversity in Section~\ref{subsec:prv-seminal}).
+The first sub-category of attacks has been mainly addressed in works on snapshot microdata publishing, but is also present in continuous publishing; however, algorithms for continuous publishing  typically accept the proposed solutions for the snapshot publishing scheme (see discussion over $k$-anonymity and $l$-diversity in Section~\ref{subsec:prv-seminal}).
 This kind of attacks is tightly coupled with publishing the (privacy-protected) sensitive attribute value.
-An example is the lack of diversity in the sensitive attribute domain, e.g.,~if all users in the data set of Table~\ref{tab:snapshot-micro} shared the same \emph{running} Status  (the sensitive attribute).
-The second and third subcategory are attacks emerging (mostly) in continuous publishing scenarios.
+An example is the lack of diversity in the sensitive attribute domain, e.g.,~if all users in the data set of Table~\ref{tab:snapshot-micro} had \emph{running} as their  Status (the sensitive attribute).
+The second and third subcategories are attacks emerging (mostly) in continuous publishing scenarios.
 Consider again the data set in Table~\ref{tab:snapshot-micro}.
 The complementary release attack means that an adversary can learn more things about the individuals (e.g.,~that there are high chances that Donald was at work) if he/she combines the information of two privacy-protected versions of this data set.
 By the data dependence attack, the status of Donald could be more certainly inferred, by taking into account the status of Dewey at the same moment and the dependencies between Donald's and Dewey's status, e.g.,~when Dewey is at home, then most probably Donald is at work.
@ -51,8 +51,8 @@ In order to better protect the privacy of Donald in case of attacks, the data sh
 \subsection{Levels of privacy protection}
 \label{subsec:prv-levels}

-The information disclosure that a data release may entail is linked to the protection level that indicates \emph{what} a privacy-preserving algorithm is trying to achieve.
-More specifically, in continuous data publishing we consider the privacy protection level with respect to not only the users but also to the \emph{events} occurring in the data.
+The information disclosure that a data release may entail is linked to the protection level that indicates \emph{what} a privacy-preserving algorithm is trying to achieve.\kat{I don't understand this first sentence}
+More specifically, in continuous data publishing we consider the privacy protection level with respect to not only the users, but also to the \emph{events} occurring in the data.
 An event is a pair of an identifying attribute of an individual and the sensitive data (including contextual information) and we can see it as a correspondence to a record in a database, where each individual may participate once.
 Data publishers typically release events in the form of sequences of data items, usually indexed in time order (time series) and geotagged, e.g.,~(`Dewey', `at home at Montmartre at $t_1$'), \dots, (`Quackmore', `dining at Opera at $t_1$').
 We use the term `users' to refer to the \emph{individuals}, also known as \emph{participants}, who are the source of the processed and published data.