Doctoral thesis (Dissertations and theses)
TRANSFORMING DATA PREPROCESSING: A HOLISTIC, NORMALIZED AND DISTRIBUTED APPROACH
Tawakuli, Amal
2022
 

Files


Full Text
PhDDissertation-AmalTawakuli.pdf
Embargo Until 25/Sep/2026 - Publisher postprint (13.43 MB)
Request a copy

All documents in ORBilu are protected by a user license.

Send to



Details



Keywords :
Data Quality; Data Preprocessing; Edge-Cloud Collaboration; Data Normalization; Data Cleaning; Feature Engineering; Sensor Fusion; Data Debiasin; Edge Computing
Abstract :
[en] Substantial volumes of data are generated at the edge as a result of an exponential increase in the number of Internet of Things (IoT) applications. IoT data are generated at edge components and, in most cases, transmitted to central or cloud infrastructures via the network. Distributing data preprocessing to the edge and closer to the data sources would address issues found in the data early in the pipeline. Thus, distribution prevents error propagation, removes redundancies, minimizes privacy leakage and optimally summarizes the information contained in the data prior to transmission. This in turn, prevents wasting valuable yet limited resources at the edge, which would otherwise be used for transmitting data that may contain anomalies and redundancies. New legal requirements such the GDPR and ethical responsibilities render data preprocessing, which addresses these emerging topics, urgent especially at the edge prior to the data leaving the premises of data owners. This PhD dissertation is divided into two parts that focus on two main directions within data preprocessing. The first part focuses on structuring and normalizing the data preprocessing design phase for AI applications. This involved an extensive and comprehensive survey of data preprocessing techniques coupled with an empirical analysis. From the survey, we introduced a holistic and normalized definition and scope of data preprocessing. We also identified the means of generalizing data preprocessing by abstracting preprocessing techniques into categories and sub-categories. Our survey and empirical analysis highlighted dependencies and relationships between the different categories and sub-categories, which determine the order of execution within preprocessing pipelines. The identified categories, sub-categories and their dependencies were assembled into a novel data preprocessing design tool that is a template from which application and dataset specific preprocessing plans and pipelines are derived. The design tool is agnostic to datasets and applications and is a crucial step towards normalizing, regulating and structuring the design of data preprocessing pipelines. The tool helps practitioners and researchers apply a modern take on data preprocessing that enhances the reproducibility of preprocessed datasets and addresses a broader spectrum of issues in the data. The second part of the dissertation focuses on leveraging edge computing within an IoT context to distribute data preprocessing at the edge. We empirically evaluated the feasibility of distributing data preprocessing techniques from different categories and assessed the impact of the distribution including on the consumption of different resources such as time, storage, bandwidth and energy. To perform the distribution, we proposed a collaborative edge-cloud framework dedicated to data preprocessing with two main mechanisms that achieve synchronization and coordination. The synchronization mechanism is an Over-The-Air (OTA) updating mechanism that remotely pushes updated preprocessing plans to the different edge components in response to changes in user requirements or the evolution of data characteristics. The coordination mechanism is a resilient and progressive execution mechanism that leverages the Directed Acyclic Graph (DAG) to represent the data preprocessing plans. Distributed preprocessing plans are shared between different cloud and edge components and are progressively executed while adhering to the topological order dictated by the DAG representation. To empirically test our proposed solutions, we developed a prototype, named DeltaWing, of our edge-cloud collaborative data preprocessing framework that consists of three stages; one central stage and two edge stages. A use-case was also designed based on a dataset obtained from Honda Research Institute US. Using DeltaWing and the use-case, we simulated an Automotive IoT application to evaluate our proposed solutions. Our empirical results highlight the effectiveness and positive impact of our framework in reducing the consumption of valuable resources (e.g., ≈ 57% reduction in bandwidth usage) at the edge while retaining information (prediction accuracy) and maintaining operational integrity. The two parts of the dissertation are interconnected yet can exist independently. Their contributions combined, constitute a generic toolset for the optimization of the data preprocessing phase.
Disciplines :
Computer science
Author, co-author :
Tawakuli, Amal ;  University of Luxembourg > Faculty of Science, Technology and Medecine (FSTM)
Language :
English
Title :
TRANSFORMING DATA PREPROCESSING: A HOLISTIC, NORMALIZED AND DISTRIBUTED APPROACH
Defense date :
26 September 2022
Number of pages :
189
Institution :
Unilu - University of Luxembourg, Esch-sur-Alzette, Luxembourg
Degree :
DOCTEUR DE L’UNIVERSITÉ DU LUXEMBOURG EN INFORMATIQUE
Promotor :
President :
Jury member :
Meinel, Christoph
Gulisano, Vincenzo
Scherer, Thomas
Kaiser, Daniel
Available on ORBilu :
since 27 September 2022

Statistics


Number of views
184 (8 by Unilu)
Number of downloads
2 (1 by Unilu)

Bibliography


Similar publications



Contact ORBilu