Reference : Transforming Time Series for Efficient and Accurate Classification
Dissertations and theses : Doctoral thesis
Engineering, computing & technology : Computer science
Computational Sciences
http://hdl.handle.net/10993/34046
Transforming Time Series for Efficient and Accurate Classification
English
Li, Daoyuan mailto [University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > >]
11-Jan-2018
University of Luxembourg, ​Luxembourg, ​​Luxembourg
Docteur en Informatique
156 + 22
Le Traon, Yves mailto
Klein, Jacques mailto
Bissyande, Tegawendé François D Assise mailto
Lin, Jessica
Geist, Matthieu
[en] Time series classification ; Time series transformatioin ; Multiscale Visibility Graph
[en] Time series data refer to sequences of data that are ordered either temporally, spatially or in another defined order. They can be frequently found in a variety of domains, including financial data analysis, medical and health monitoring and industrial automation applications. Due to their abundance and wide application scenarios, there has been an increasing need for efficient machine learning algorithms to extract information and build knowledge from these data. One of the major tasks in time series mining is time series classification (TSC), which consists of applying a learning algorithm on labeled data to train a model that will then be used to predict the classes of samples from an unlabeled data set. Due to the sequential characteristic of time series data, state-of-the-art classification algorithms (such as SVM and Random Forest) that performs well for generic data are usually not suitable for TSC. In order to improve the performance of TSC tasks, this dissertation proposes different methods to transform time series data for a better feature extraction process as well as novel algorithms to achieve better classification performance in terms of computation efficiency and classification accuracy.

In the first part of this dissertation, we conduct a large scale empirical study that takes advantage of discrete wavelet transform (DWT) for time series dimensionality reduction. We first transform real-valued time series data using different families of DWT. Then we apply dynamic time warping (DTW)-based 1NN classification on 39 datasets and find out that existing DWT-based lossy compression approaches can help to overcome the challenges of storage and computation time. Furthermore, we provide assurances to practitioners by empirically showing, with various datasets and with several DWT approaches, that TSC algorithms yield similar accuracy on both compressed (i.e.,
approximated) and raw time series data. We also show that, in some datasets, wavelets may actually help in reducing noisy variations which deteriorate the performance of TSC tasks. In a few cases, we note that the residual details/noises from compression are more useful for recognizing data patterns.

In the second part, we propose a language model-based approach for TSC named Domain Series Corpus (DSCo), in order to take advantage of mature techniques from both time series mining and Natural Language Processing (NLP) communities.
After transforming real-valued time series into texts using Symbolic Aggregate approXimation (SAX), we build per-class language models (unigrams and bigrams) from these symbolized text corpora. To classify unlabeled samples, we compute the fitness of each symbolized sample against all per-class models and choose the class represented by the model with the best fitness score. Through extensive experiments on an open dataset archive, we demonstrate that DSCo performs similarly to approaches working with original uncompressed numeric data. We further propose DSCo-NG to improve the computation efficiency and classification accuracy of DSCo. In contrast to DSCo where we try to find the best way to recursively segment time series, DSCo-NG breaks time series into smaller segments of the same size, this simplification also leads to simplified language model inference in the training phase and slightly higher classification accuracy.

The third part of this dissertation presents a multiscale visibility graph representation for time series as well as feature extraction methods for TSC, so that both global and local features are fully extracted from time series data. Unlike traditional TSC approaches that seek to find global similarities in time series databases (e.g., 1NN-DTW) or methods specializing in locating local patterns/subsequences (e.g., shapelets), we extract solely statistical features from graphs that are generated from time series. Specifically, we augment time series by means of their multiscale approximations, which are further transformed into a set of visibility graphs. After extracting probability distributions of small motifs, density, assortativity, etc., these features are used for building highly accurate classification models using generic classifiers (e.g., Support Vector Machine and eXtreme Gradient Boosting). Based on extensive experiments on a large number of open datasets and comparison with five state-of-the-art TSC algorithms, our approach is shown to be both accurate and efficient: it is more accurate than Learning Shapelets and at the same time faster than Fast Shapelets.

Finally, we list a few industrial applications that relevant to our research work, including Non-Intrusive Load Monitoring as well as anomaly detection and visualization by means for hierarchical clustering for time series data.

In summary, this dissertation explores different possibilities to improve the efficiency and accuracy of TSC algorithms. To that end, we employ a range of techniques including wavelet transforms, symbolic approximations, language models and graph mining algorithms. We experiment and evaluate our approaches using publicly available time series datasets. Comparison with the state-of-the-art shows that the approaches developed in this dissertation perform well, and contribute to advance the field of TSC.
Researchers ; Professionals ; Students ; General public ; Others
http://hdl.handle.net/10993/34046
The outcome of this dissertation comes from an industrial partnership project with Paul Wurth S.A..

File(s) associated to this reference

Fulltext file(s):

FileCommentaryVersionSizeAccess
Open access
thesis_final.pdfAuthor postprint3.69 MBView/Open

Bookmark and Share SFX Query

All documents in ORBilu are protected by a user license.