data: Reviewed data categories

This commit is contained in:
Manos Katsomallos 2021-10-21 22:59:18 +02:00
parent 4f02c649dc
commit 6240f457e7
3 changed files with 88 additions and 59 deletions

View File

@ -1,30 +1,30 @@
\begin{table} \begin{figure}
\centering \centering
\subcaptionbox{Microdata\label{tab:continuous-micro}}{% \subcaptionbox{Microdata\label{tab:continuous-micro}}{%
\adjustbox{max width=\linewidth}{% \adjustbox{max width=\linewidth}{%
\begin{tabular}{@{}ccc@{}} \begin{tabular}{@{}ccc@{}}
\begin{tabular}{@{}lrll@{}} \begin{tabular}{@{}lrll@{}}
\toprule \toprule
\textit{Name} & \multicolumn{1}{c}{Age} & Location & Status \\ \emph{Name} & \multicolumn{1}{c}{Age} & Location & Status \\
\midrule \midrule
Donald & $27$ & Le Marais & at work \\ Donald & $27$ & Le Marais & at work \\
Daisy & $25$ & Belleville & driving \\ Daisy & $25$ & Belleville & driving \\
Huey & $12$ & Montmartre & running \\ Huey & $12$ & Montmartre & running \\
Dewey & $11$ & Montmartre & at home \\ Dewey & $11$ & Montmartre & at home \\
Louie & $10$ & Latin Quarter & walking \\ Louie & $10$ & Quartier Latin & walking \\
Quackmore & $62$ & Opera & dining \\ Quackmore & $62$ & Opéra & dining \\
\bottomrule \bottomrule
\multicolumn{4}{c}{$t_1$} \\ \multicolumn{4}{c}{$t_1$} \\
\end{tabular} & \end{tabular} &
\begin{tabular}{@{}lrll@{}} \begin{tabular}{@{}lrll@{}}
\toprule \toprule
\textit{Name} & \multicolumn{1}{c}{Age} & Location & Status \\ \emph{Name} & \multicolumn{1}{c}{Age} & Location & Status \\
\midrule \midrule
Donald & $27$ & Montmartre & driving \\ Donald & $27$ & Montmartre & driving \\
Daisy & $25$ & Montmartre & at the mall \\ Daisy & $25$ & Montmartre & at the mall \\
Huey & $12$ & Latin Quarter & sightseeing \\ Huey & $12$ & Quartier Latin & sightseeing \\
Dewey & $11$ & Opera & walking \\ Dewey & $11$ & Opéra & walking \\
Louie & $10$ & Latin Quarter & at home \\ Louie & $10$ & Quartier Latin & at home \\
Quackmore & $62$ & Montmartre & biking \\ Quackmore & $62$ & Montmartre & biking \\
\bottomrule \bottomrule
\multicolumn{4}{c}{$t_2$} \\ \multicolumn{4}{c}{$t_2$} \\
@ -40,13 +40,18 @@
& \multicolumn{1}{c}{$t_1$} & \multicolumn{1}{c}{$t_2$} & \dots \\ & \multicolumn{1}{c}{$t_1$} & \multicolumn{1}{c}{$t_2$} & \dots \\
\midrule \midrule
Belleville & $1$ & $0$ & \dots \\ Belleville & $1$ & $0$ & \dots \\
Latin Quarter & $1$ & $2$ & \dots \\ Quartier Latin & $1$ & $2$ & \dots \\
Le Marais & $1$ & $0$ & \dots \\ Le Marais & $1$ & $0$ & \dots \\
Montmartre & $2$ & $3$ & \dots \\ Montmartre & $2$ & $3$ & \dots \\
Opera & $1$ & $1$ & \dots \\ Opéra & $1$ & $1$ & \dots \\
\bottomrule \bottomrule
\end{tabular}% \end{tabular}%
}% }%
\caption{Continuous data observation \kat{continuous data observation sounds like an action.. better say directly microdata and statistics gathered in consequent timestamps?} of (a)~microdata, and (b)~corresponding statistics at multiple timestamps.} \caption{
\label{tab:continuous} % Continuous data observation
\end{table} % \kat{continuous data observation sounds like an action.. better say directly microdata and statistics gathered in consequent timestamps?}
% of
(a)~Microdata, and (b)~the corresponding statistics at multiple timestamps.
}
\label{fig:continuous}
\end{figure}

View File

@ -1,16 +1,16 @@
\begin{table} \begin{figure}
\centering\hspace{\fill} \centering\hspace{\fill}
\subcaptionbox{Microdata\label{tab:snapshot-micro}}{% \subcaptionbox{Microdata\label{tab:snapshot-micro}}{%
\begin{tabular}{@{}lrll@{}} \begin{tabular}{@{}lrll@{}}
\toprule \toprule
\textit{Name} & \multicolumn{1}{c}{Age} & Location & Status \\ \emph{Name} & \multicolumn{1}{c}{Age} & Location & Status \\
\midrule \midrule
Donald & $27$ & Le Marais & at work \\ Donald & $27$ & Le Marais & at work \\
Daisy & $25$ & Belleville & driving \\ Daisy & $25$ & Belleville & driving \\
Huey & $12$ & Montmartre & running \\ Huey & $12$ & Montmartre & running \\
Dewey & $11$ & Montmartre & at home \\ Dewey & $11$ & Montmartre & at home \\
Louie & $10$ & Latin Quarter & walking \\ Louie & $10$ & Quartier Latin & walking \\
Quackmore & $62$ & Opera & dining \\ Quackmore & $62$ & Opéra & dining \\
\bottomrule \bottomrule
\end{tabular}% \end{tabular}%
}\hspace{\fill} }\hspace{\fill}
@ -20,14 +20,14 @@
Location & \multicolumn{1}{c@{}}{Count} \\ Location & \multicolumn{1}{c@{}}{Count} \\
\midrule \midrule
Belleville & $1$ \\ Belleville & $1$ \\
Latin Quarter & $1$ \\ Quartier Latin & $1$ \\
Le Marais & $1$ \\ Le Marais & $1$ \\
Montmartre & $2$ \\ Montmartre & $2$ \\
Opera & $1$ \\ Opéra & $1$ \\
\bottomrule \bottomrule
\\ \\
\end{tabular}% \end{tabular}%
}\hspace{\fill} }\hspace{\fill}
\caption{Example of raw user-generated (a)~microdata, and related (b)~statistical data for a specific timestamp.} \caption{Example of raw user-generated (a)~microdata, and the related (b)~statistical data for a specific timestamp.}
\label{tab:snapshot} \label{fig:snapshot}
\end{table} \end{figure}

View File

@ -1,59 +1,83 @@
\section{Data sets and data publishing} \section{Data sets and data publishing}
\label{sec:data} \label{sec:data}
In this section, we categorize user-generated data sets in terms of their form and their processing and publishing.
\subsection{Data categories} \subsection{Data categories}
\label{subsec:data-categories} \label{subsec:data-categories}
\kat{Again, the title of the thesis is user-generated data, so there should exist also a distinction between user-generated and third party generated data. Hospital data for example, would fall in the third party generated data.} % \kat{Again, the title of the thesis is user-generated data, so there should exist also a distinction between user-generated and third party generated data. Hospital data for example, would fall in the third party generated data.}
In this thesis, we are interested in data that contain information about individuals and their actions, as these are highly privacy-sensitive. In this thesis, we are interested in data that contain information about individuals and their actions, as these are highly privacy-sensitive.
We firstly classify data based on their content \kat{'based on their content' reminds me of health data, trajectories, etc., not if they are aggregated or not. }: A typical category of such data are \emph{user-generated data} which are the outcome of users--services interactions, e.g., social media, location-based services (LBS), etc.
These interactions result in the generation of \emph{data items} which are tuples that typically contain a user identifier, a timestamp, and context information (e.g.,~location, activity, etc.)
We firstly classify data based on their
% content \kat{'based on their content' reminds me of health data, trajectories, etc., not if they are aggregated or not. }:
form in \emph{microdata} and \emph{statistical data}.
\kat{Use full sentences, even in the bullets. } % \kat{Use full sentences, even in the bullets. }
% \mk{OK}
\begin{itemize} \begin{itemize}
\item \emph{Microdata}---the data items \kat{define data item} in their raw, usually tabular, form pertaining to individuals or objects \kat{objects?}. \item \emph{Microdata} are the data items \kat{define data item} in their raw, usually tabular, form pertaining to individuals.
\item \emph{Statistical data}---the outcome of statistical processes on microdata. % or objects \kat{objects?}.
\item \emph{Statistical data} are the outcome of statistical processes on microdata, e.g.,~average, count, sum, etc.
\end{itemize} \end{itemize}
To accompany and facilitate the descriptions in this chapter, we provide the following running example. To accompany and facilitate the descriptions in this chapter, we provide Example~\ref{ex:snapshot} as a running example.
\begin{example} \begin{example}
\label{ex:snapshot} \label{ex:snapshot}
Users interact with an LBS by making queries in order to retrieve some useful location-based information or just reporting user-state at various locations. Users interact with an LBS by making queries in order to retrieve some useful location-based information or just reporting user-state at various locations.
This user--LBS interaction generates user-related data, organized in a schema with the following attributes: \emph{Name} (the unique identifier of the table), \emph{Age}, \emph{Location}, and \emph{Status} (Table~\ref{tab:snapshot-micro}). This user--LBS interaction generates user-related data, organized in a schema with the following attributes: \emph{Name} (the unique identifier of the table), \emph{Age}, \emph{Location}, and \emph{Status} (Table~\ref{tab:snapshot-micro}).
The `Status' attribute includes information that characterizes the user's state or the query itself, and its value varies according to the service functionality. The `Status' attribute includes information that characterizes the user state or the query itself, and its value varies according to the service functionality.
Subsequently, the generated data are aggregated (by issuing count queries over them) in order to derive useful information about the popularity of the venues during the day (Table~\ref{tab:snapshot-statistical}). Subsequently, the generated data are aggregated (by issuing count queries over them) in order to derive useful information about the popularity of the venues during the day (Table~\ref{tab:snapshot-statistical}).
\includetable{preliminaries/snapshot} \includetable{preliminaries/snapshot}
\end{example} \end{example}
\kat{I miss the definition of data. You speak of data items, data values, what is the difference to data?} % \kat{I miss the definition of data. You speak of data items, data values, what is the difference to data?}
% \mk{Done above}
An example of microdata is displayed in Table~\ref{tab:snapshot-micro}, while an example of statistical data in Table~\ref{tab:snapshot-statistical}. An example of microdata is displayed in Table~\ref{tab:snapshot-micro}, while an example of statistical data in Table~\ref{tab:snapshot-statistical}.
Data, in either of these two forms, may have a special property called~\emph{continuity}, i.e.,~their values change and can be observed through time. \kat{The way that you define it here reminds temporal data. What is the difference?} Data, in either of these two forms, may have a special property called~\emph{continuity}, i.e.,~their values change and can be observed through time.
\kat{If you say that data may have a special property called continuity, we wonder about the existence of other properties. Be more explicit on why you choose to mention only this property.} % \kat{The way that you define it here reminds temporal data. What is the difference?}
Depending on the span of the observation, we distinguish the following categories: % \mk{It's the same, we talk about time in data, i.e., temporal data. No?}
% \kat{If you say that data may have a special property called continuity, we wonder about the existence of other properties. Be more explicit on why you choose to mention only this property.}
% \mk{OK}
Observing the evolution of the data attribute values over time may offer valuable insight regarding the underlying population not only about the past but also both about the present and future.
Depending on the span of the observation, we categorize data in \emph{finite} and \emph{infinite}.
\begin{itemize} \begin{itemize}
\item \emph{Finite data}---data are observed during a predefined time interval. \item \emph{Finite data} are observed during a predefined time interval.
\item \emph{Infinite data}---data are observed in an uninterrupted fashion. \item \emph{Infinite data} are observed in an uninterrupted fashion.
\end{itemize} \end{itemize}
\begin{example} \begin{example}
\label{ex:continuous} \label{ex:continuous}
Extending Example~\ref{ex:snapshot}, \kat{Maybe put these three tables in a Figure instead of a table?} Table~\ref{tab:continuous} shows an example of continuous data observation \kat{maybe mention explicitly before what is data observation and continuous data observation }, by introducing one data table for each consecutive timestamp. Extending Example~\ref{ex:snapshot},
The two data tables over the time-span $[t_1, t_2]$ are an example of finite data. % \kat{Maybe put these three tables in a Figure instead of a table?}
% \mk{OK}
Table~\ref{fig:continuous} shows an example of continuous data,
% observation
% \kat{maybe mention explicitly before what is data observation and continuous data observation }
% \mk{Did it above}
by introducing one data table for each consecutive timestamp.
The two data tables over the time span $[t_1, t_2]$ are an example of finite data.
Infinite data are the whole series of data obtained over the period~$[t_1, \infty)$ (infinity is denoted by `\dots'). Infinite data are the whole series of data obtained over the period~$[t_1, \infty)$ (infinity is denoted by `\dots').
\includetable{preliminaries/continuous} \includetable{preliminaries/continuous}
\end{example} \end{example}
\kat{Why isn't next the presentation of sequential and incremental in bullets, as for the categories before?} % \kat{Why isn't next the presentation of sequential and incremental in bullets, as for the categories before?}
% \mk{Fixed}
We further define two sub-categories applicable to both finite and infinite data: \emph{sequential} and \emph{incremental} data; these two subcategories are not exhaustive, i.e.,~not all data sets belong to the one or the other category. We further define two sub-categories applicable to both finite and infinite data: \emph{sequential} and \emph{incremental} data; these two subcategories are not exhaustive, i.e.,~not all data sets belong to the one or the other category.
In sequential data, the value of the observed variable changes, depending on its previous value.
\begin{itemize}
\item \emph{Sequential data} have variable values that change depending on their previous values.
For example, trajectories are finite sequences of location stamps, as naturally the position at each timestamp is connected to the position at the previous timestamp. For example, trajectories are finite sequences of location stamps, as naturally the position at each timestamp is connected to the position at the previous timestamp.
In incremental data, an original data set is augmented in each subsequent timestamp with supplementary information. \item \emph{Incremental data} are augmented at each subsequent timestamp with supplementary information.
For example, trajectories can be considered as incremental data, when at each timestamp we consider all the previously visited locations by an individual, incremented by his current position. For example, trajectories can be considered as incremental data when at each timestamp we consider all the previously visited locations by an individual incremented by the individual's current position.
\end{itemize}
\subsection{Data processing and publishing} \subsection{Data processing and publishing}