TRANSFORMING DATA PREPROCESSING: A HOLISTIC, NORMALIZED AND DISTRIBUTED APPROACH

Data Quality; Data Preprocessing; Edge-Cloud Collaboration; Data Normalization; Data Cleaning; Feature Engineering; Sensor Fusion; Data Debiasin; Edge Computing

Résumé :

[en] Substantial volumes of data are generated at the edge as a result of an exponential increase in the number of Internet of Things (IoT) applications. IoT data are generated at edge components and, in most cases, transmitted to central or cloud infrastructures via the network. Distributing data preprocessing to the edge and closer to the data sources would address issues found in the data early in the pipeline. Thus, distribution prevents error propagation, removes redundancies, minimizes privacy leakage and optimally summarizes the information contained in the data prior to transmission. This in turn, prevents wasting valuable yet limited resources at the edge, which would otherwise be used for transmitting data that may contain anomalies and redundancies. New legal requirements such the GDPR and ethical responsibilities render data preprocessing, which addresses these emerging topics, urgent especially at the edge prior to the data leaving the premises of data owners. This PhD dissertation is divided into two parts that focus on two main directions within data preprocessing. The first part focuses on structuring and normalizing the data preprocessing design phase for AI applications. This involved an extensive and comprehensive survey of data preprocessing techniques coupled with an empirical analysis. From the survey, we introduced a holistic and normalized definition and scope of data preprocessing. We also identified the means of generalizing data preprocessing by abstracting preprocessing techniques into categories and sub-categories. Our survey and empirical analysis highlighted dependencies and relationships between the different categories and sub-categories, which determine the order of execution within preprocessing pipelines. The identified categories, sub-categories and their dependencies were assembled into a novel data preprocessing design tool that is a template from which application and dataset specific preprocessing plans and pipelines are derived. The design tool is agnostic to datasets and applications and is a crucial step towards normalizing, regulating and structuring the design of data preprocessing pipelines. The tool helps practitioners and researchers apply a modern take on data preprocessing that enhances the reproducibility of preprocessed datasets and addresses a broader spectrum of issues in the data. The second part of the dissertation focuses on leveraging edge computing within an IoT context to distribute data preprocessing at the edge. We empirically evaluated the feasibility of distributing data preprocessing techniques from different categories and assessed the impact of the distribution including on the consumption of different resources such as time, storage, bandwidth and energy. To perform the distribution, we proposed a collaborative edge-cloud framework dedicated to data preprocessing with two main mechanisms that achieve synchronization and coordination. The synchronization mechanism is an Over-The-Air (OTA) updating mechanism that remotely pushes updated preprocessing plans to the different edge components in response to changes in user requirements or the evolution of data characteristics. The coordination mechanism is a resilient and progressive execution mechanism that leverages the Directed Acyclic Graph (DAG) to represent the data preprocessing plans. Distributed preprocessing plans are shared between different cloud and edge components and are progressively executed while adhering to the topological order dictated by the DAG representation. To empirically test our proposed solutions, we developed a prototype, named DeltaWing, of our edge-cloud collaborative data preprocessing framework that consists of three stages; one central stage and two edge stages. A use-case was also designed based on a dataset obtained from Honda Research Institute US. Using DeltaWing and the use-case, we simulated an Automotive IoT application to evaluate our proposed solutions. Our empirical results highlight the effectiveness and positive impact of our framework in reducing the consumption of valuable resources (e.g., ≈ 57% reduction in bandwidth usage) at the edge while retaining information (prediction accuracy) and maintaining operational integrity. The two parts of the dissertation are interconnected yet can exist independently. Their contributions combined, constitute a generic toolset for the optimization of the data preprocessing phase.

Disciplines :

Sciences informatiques

Auteur, co-auteur :

TAWAKULI, Amal ; University of Luxembourg > Faculty of Science, Technology and Medecine (FSTM)

Langue du document :

Anglais

Titre :

TRANSFORMING DATA PREPROCESSING: A HOLISTIC, NORMALIZED AND DISTRIBUTED APPROACH

Date de soutenance :

26 septembre 2022

Nombre de pages :

189

Institution :

Unilu - University of Luxembourg, Esch-sur-Alzette, Luxembourg

Intitulé du diplôme :

DOCTEUR DE L’UNIVERSITÉ DU LUXEMBOURG EN INFORMATIQUE

Promoteur :

ENGEL, Thomas

Président du jury :

SORGER, Ulrich

Membre du jury :

Meinel, Christoph

Gulisano, Vincenzo

Scherer, Thomas

Kaiser, Daniel

Disponible sur ORBilu :

depuis le 27 septembre 2022

Statistiques

Nombre de vues

269 (dont 10 Unilu)

Nombre de téléchargements

2 (dont 1 Unilu)

Voir plus de statistiques