![]() Li, Daoyuan ![]() ![]() ![]() in Journal of Software: Evolution and Process (2019) One single code change can significantly influence a wide range of software systems and their users. For example, 1) adding a new feature can spread defects in several modules, while 2) changing an API ... [more ▼] One single code change can significantly influence a wide range of software systems and their users. For example, 1) adding a new feature can spread defects in several modules, while 2) changing an API method can improve the performance of all client programs. Developers often may not clearly know whether their or others’ changes are influential at commit time. Rather, it turns out to be influential after affecting many aspects of a system later. This paper investigates influential software changes and proposes an approach to identify them early, i.e., immediately when they are applied. We first conduct a post- mortem analysis to discover existing influential changes by using intuitions such as isolated changes and changes referred by other changes in 10 open source projects. Then we re-categorize all identified changes through an open-card sorting process. Subsequently, we conduct a survey with 89 developers to confirm our influential change categories. Finally, from our ground truth we extract features, including metrics such as the complexity of changes, terms in commit logs and file centrality in co-change graphs, to build ma- chine learning classifiers. The experiment results show that our prediction model achieves overall with random samples 86.8% precision, 74% recall and 80.4% F-measure respectively. [less ▲] Detailed reference viewed: 270 (22 UL)![]() Li, Daoyuan ![]() ![]() in 21st International Conference on Extending Database Technology (2018, March) This paper presents a multiscale visibility graph representation for time series as well as feature extraction methods for time series classification (TSC). Unlike traditional TSC approaches that seek to ... [more ▼] This paper presents a multiscale visibility graph representation for time series as well as feature extraction methods for time series classification (TSC). Unlike traditional TSC approaches that seek to find global similarities in time series databases (eg., Nearest Neighbor with Dynamic Time Warping distance) or methods specializing in locating local patterns/subsequences (eg., shapelets), we extract solely statistical features from graphs that are generated from time series. Specifically, we augment time series by means of their multiscale approximations, which are further transformed into a set of visibility graphs. After extracting probability distributions of small motifs, density, assortativity, etc., these features are used for building highly accurate classification models using generic classifiers (eg., Support Vector Machine and eXtreme Gradient Boosting). Thanks to the way how we transform time series into graphs and extract features from them, we are able to capture both global and local features from time series. Based on extensive experiments on a large number of open datasets and comparison with five state-of-the-art TSC algorithms, our approach is shown to be both accurate and efficient: it is more accurate than Learning Shapelets and at the same time faster than Fast Shapelets. [less ▲] Detailed reference viewed: 829 (13 UL)![]() Li, Daoyuan ![]() Doctoral thesis (2018) Time series data refer to sequences of data that are ordered either temporally, spatially or in another defined order. They can be frequently found in a variety of domains, including financial data ... [more ▼] Time series data refer to sequences of data that are ordered either temporally, spatially or in another defined order. They can be frequently found in a variety of domains, including financial data analysis, medical and health monitoring and industrial automation applications. Due to their abundance and wide application scenarios, there has been an increasing need for efficient machine learning algorithms to extract information and build knowledge from these data. One of the major tasks in time series mining is time series classification (TSC), which consists of applying a learning algorithm on labeled data to train a model that will then be used to predict the classes of samples from an unlabeled data set. Due to the sequential characteristic of time series data, state-of-the-art classification algorithms (such as SVM and Random Forest) that performs well for generic data are usually not suitable for TSC. In order to improve the performance of TSC tasks, this dissertation proposes different methods to transform time series data for a better feature extraction process as well as novel algorithms to achieve better classification performance in terms of computation efficiency and classification accuracy. In the first part of this dissertation, we conduct a large scale empirical study that takes advantage of discrete wavelet transform (DWT) for time series dimensionality reduction. We first transform real-valued time series data using different families of DWT. Then we apply dynamic time warping (DTW)-based 1NN classification on 39 datasets and find out that existing DWT-based lossy compression approaches can help to overcome the challenges of storage and computation time. Furthermore, we provide assurances to practitioners by empirically showing, with various datasets and with several DWT approaches, that TSC algorithms yield similar accuracy on both compressed (i.e., approximated) and raw time series data. We also show that, in some datasets, wavelets may actually help in reducing noisy variations which deteriorate the performance of TSC tasks. In a few cases, we note that the residual details/noises from compression are more useful for recognizing data patterns. In the second part, we propose a language model-based approach for TSC named Domain Series Corpus (DSCo), in order to take advantage of mature techniques from both time series mining and Natural Language Processing (NLP) communities. After transforming real-valued time series into texts using Symbolic Aggregate approXimation (SAX), we build per-class language models (unigrams and bigrams) from these symbolized text corpora. To classify unlabeled samples, we compute the fitness of each symbolized sample against all per-class models and choose the class represented by the model with the best fitness score. Through extensive experiments on an open dataset archive, we demonstrate that DSCo performs similarly to approaches working with original uncompressed numeric data. We further propose DSCo-NG to improve the computation efficiency and classification accuracy of DSCo. In contrast to DSCo where we try to find the best way to recursively segment time series, DSCo-NG breaks time series into smaller segments of the same size, this simplification also leads to simplified language model inference in the training phase and slightly higher classification accuracy. The third part of this dissertation presents a multiscale visibility graph representation for time series as well as feature extraction methods for TSC, so that both global and local features are fully extracted from time series data. Unlike traditional TSC approaches that seek to find global similarities in time series databases (e.g., 1NN-DTW) or methods specializing in locating local patterns/subsequences (e.g., shapelets), we extract solely statistical features from graphs that are generated from time series. Specifically, we augment time series by means of their multiscale approximations, which are further transformed into a set of visibility graphs. After extracting probability distributions of small motifs, density, assortativity, etc., these features are used for building highly accurate classification models using generic classifiers (e.g., Support Vector Machine and eXtreme Gradient Boosting). Based on extensive experiments on a large number of open datasets and comparison with five state-of-the-art TSC algorithms, our approach is shown to be both accurate and efficient: it is more accurate than Learning Shapelets and at the same time faster than Fast Shapelets. Finally, we list a few industrial applications that relevant to our research work, including Non-Intrusive Load Monitoring as well as anomaly detection and visualization by means for hierarchical clustering for time series data. In summary, this dissertation explores different possibilities to improve the efficiency and accuracy of TSC algorithms. To that end, we employ a range of techniques including wavelet transforms, symbolic approximations, language models and graph mining algorithms. We experiment and evaluate our approaches using publicly available time series datasets. Comparison with the state-of-the-art shows that the approaches developed in this dissertation perform well, and contribute to advance the field of TSC. [less ▲] Detailed reference viewed: 603 (38 UL)![]() Li, Li ![]() ![]() ![]() in Journal of Computer Science and Technology (2017) To devise efficient approaches and tools for detecting malicious packages in the Android ecosystem, researchers are increasingly required to have a deep understanding of malware. There is thus a need to ... [more ▼] To devise efficient approaches and tools for detecting malicious packages in the Android ecosystem, researchers are increasingly required to have a deep understanding of malware. There is thus a need to provide a framework for dissecting malware and locating malicious program fragments within app code in order to build a comprehensive dataset of malicious samples. Towards addressing this need, we propose in this work a tool-based approach called HookRanker, which provides ranked lists of potentially malicious packages based on the way malware behaviour code is triggered. With experiments on a ground truth of piggybacked apps, we are able to automatically locate the malicious packages from piggybacked Android apps with an accuracy@5 of 83.6% for such packages that are triggered through method invocations and an accuracy@5 of 82.2% for such packages that are triggered independently. [less ▲] Detailed reference viewed: 230 (10 UL)![]() Li, Daoyuan ![]() ![]() ![]() Report (2017) Nowadays, a significant portion of the total energy consumption is attributed to the buildings sector. In order to save energy and protect the environment, energy consumption in buildings must be more ... [more ▼] Nowadays, a significant portion of the total energy consumption is attributed to the buildings sector. In order to save energy and protect the environment, energy consumption in buildings must be more efficient. At the same time, buildings should offer the same (if not more) comfort to their occupants. Consequently, modern buildings have been equipped with various sensors and actuators and interconnected control systems to meet occupants’ requirements. Unfortunately, so far, Building Automation Systems data have not been well-exploited due to technical and cost limitations. Yet, it can be exceptionally beneficial to take full advantage of the data flowing inside buildings in order to diagnose issues, explore solutions and improve occupant-building interactions. This paper presents a plug-and-play and holistic data mining framework named PHoliData for smart buildings to collect, store, visualize and mine useful information and domain knowledge from data in smart buildings. PHoliData allows non technical experts to easily explore and understand their buildings with minimum IT support. An architecture of this framework has been introduced and a prototype has been implemented and tested against real-world settings. Discussions with industry experts have suggested the system to be extremely helpful for understanding buildings, since it can provide hints about energy efficiency improvements. Finally, extensive experiments have demonstrated the feasibility of such a framework in practice and its advantage and potential for buildings operators. [less ▲] Detailed reference viewed: 172 (7 UL)![]() Li, Li ![]() ![]() ![]() Poster (2017, May) The Android packaging model offers adequate opportunities for attackers to inject malicious code into popular benign apps, attempting to develop new malicious apps that can then be easily spread to a ... [more ▼] The Android packaging model offers adequate opportunities for attackers to inject malicious code into popular benign apps, attempting to develop new malicious apps that can then be easily spread to a large user base. Despite the fact that the literature has already presented a number of tools to detect piggybacked apps, there is still lacking a comprehensive investigation on the piggybacking processes. To fill this gap, in this work, we collect a large set of benign/piggybacked app pairs that can be taken as benchmark apps for further investigation. We manually look into these benchmark pairs for understanding the characteristics of piggybacking apps and eventually we report 20 interesting findings. We expect these findings to initiate new research directions such as practical and scalable piggybacked app detection, explainable malware detection, and malicious code location. [less ▲] Detailed reference viewed: 289 (12 UL)![]() Li, Li ![]() ![]() ![]() in Abstract book of the 4th IEEE/ACM International Conference on Mobile Software Engineering and Systems (MobileSoft 2017) (2017, May) To devise efficient approaches and tools for detecting malicious packages in the Android ecosystem, researchers are increasingly required to have a deep understanding of malware. There is thus a need to ... [more ▼] To devise efficient approaches and tools for detecting malicious packages in the Android ecosystem, researchers are increasingly required to have a deep understanding of malware. There is thus a need to provide a framework for dissecting malware and locating malicious program fragments within app code in order to build a comprehensive dataset of malicious samples. Towards addressing this need, we propose in this work a tool-based approach called HookRanker, which provides ranked lists of potentially malicious packages based on the way malware behaviour code is triggered. With experiments on a ground truth set of piggybacked apps, we are able to automatically locate the malicious packages from piggybacked Android apps with an accuracy of 83.6% in verifying the top five reported items. [less ▲] Detailed reference viewed: 334 (23 UL)![]() Li, Daoyuan ![]() ![]() ![]() in The 32nd ACM Symposium on Applied Computing (SAC 2017) (2017, April) As the concept of Internet of Things (IoT) develops, buildings are equipped with increasingly heterogeneous sensors to track building status as well as occupant activities. As users become more and more ... [more ▼] As the concept of Internet of Things (IoT) develops, buildings are equipped with increasingly heterogeneous sensors to track building status as well as occupant activities. As users become more and more concerned with their privacy in buildings, explicit sensing techniques can lead to uncomfortableness and resistance from occupants. In this paper, we adapt a sensing by proxy paradigm that monitors building status and coarse occupant activities through agglomerative clustering of indoor temperature movements. Through extensive experimentation on 86 classrooms, offices and labs in a five-story school building in western Europe, we prove that indoor temperature movements can be leveraged to infer latent information about indoor environments, especially about rooms' relative physical locations and rough type of occupant activities. Our results evidence a cost-effective approach to extending commercial building control systems and gaining extra relevant intelligence from such systems. [less ▲] Detailed reference viewed: 269 (19 UL)![]() Li, Li ![]() ![]() ![]() in IEEE Transactions on Information Forensics and Security (2017) The Android packaging model offers ample opportunities for malware writers to piggyback malicious code in popular apps, which can then be easily spread to a large user base. Although recent research has ... [more ▼] The Android packaging model offers ample opportunities for malware writers to piggyback malicious code in popular apps, which can then be easily spread to a large user base. Although recent research has produced approaches and tools to identify piggybacked apps, the literature lacks a comprehensive investigation into such phenomenon. We fill this gap by 1) systematically building a large set of piggybacked and benign apps pairs, which we release to the community, 2) empirically studying the characteristics of malicious piggybacked apps in comparison with their benign counterparts, and 3) providing insights on piggybacking processes. Among several findings providing insights, analysis techniques should build upon to improve the overall detection and classification accuracy of piggybacked apps, we show that piggybacking operations not only concern app code but also extensively manipulates app resource files, largely contradicting common beliefs. We also find that piggybacking is done with little sophistication, in many cases automatically, and often via library code. [less ▲] Detailed reference viewed: 356 (28 UL)![]() Li, Daoyuan ![]() ![]() ![]() in The 15th International Symposium on Intelligent Data Analysis (2016, October) The abundance of time series data in various domains and their high dimensionality characteristic are challenging for harvesting useful information from them. To tackle storage and processing challenges ... [more ▼] The abundance of time series data in various domains and their high dimensionality characteristic are challenging for harvesting useful information from them. To tackle storage and processing challenges, compression-based techniques have been proposed. Our previous work, Domain Series Corpus (DSCo), compresses time series into symbolic strings and takes advantage of language modeling techniques to extract from the training set knowledge about different classes. However, this approach was flawed in practice due to its excessive memory usage and the need for a priori knowledge about the dataset. In this paper we propose DSCo-NG, which reduces DSCo’s complexity and offers an efficient (linear time complexity and low memory footprint), accurate (performance comparable to approaches working on uncompressed data) and generic (so that it can be applied to various domains) approach for time series classification. Our confidence is backed with extensive experimental evaluation against publicly accessible datasets, which also offers insights on when DSCo-NG can be a better choice than others. [less ▲] Detailed reference viewed: 266 (23 UL)![]() Li, Daoyuan ![]() ![]() ![]() in International Journal of Software Engineering and Knowledge Engineering (2016), 26(9&10), 13611377 Detailed reference viewed: 203 (12 UL)![]() Li, Daoyuan ![]() ![]() ![]() in 12th International Conference on Machine Learning and Data Mining (MLDM 2016) (2016, July) Time series data are abundant in various domains and are often characterized as large in size and high in dimensionality, leading to storage and processing challenges. Symbolic representation of time ... [more ▼] Time series data are abundant in various domains and are often characterized as large in size and high in dimensionality, leading to storage and processing challenges. Symbolic representation of time series – which transforms numeric time series data into texts – is a promising technique to address these challenges. However, these techniques are essentially lossy compression functions and information are partially lost during transformation. To that end, we bring up a novel approach named Domain Series Corpus (DSCo), which builds per-class language models from the symbolized texts. To classify unlabeled samples, we compute the fitness of each symbolized sample against all per-class models and choose the class represented by the model with the best fitness score. Our work innovatively takes advantage of mature techniques from both time series mining and NLP communities. Through extensive experiments on an open dataset archive, we demonstrate that it performs similarly to approaches working with original uncompressed numeric data. [less ▲] Detailed reference viewed: 387 (29 UL)![]() Li, Daoyuan ![]() ![]() ![]() in The 28th International Conference on Software Engineering and Knowledge Engineering (SEKE 2016) (2016, July) Time series mining has become essential for extracting knowledge from the abundant data that flows out from many application domains. To overcome storage and processing challenges in time series mining ... [more ▼] Time series mining has become essential for extracting knowledge from the abundant data that flows out from many application domains. To overcome storage and processing challenges in time series mining, compression techniques are being used. In this paper, we investigate the loss/gain of performance of time series classification approaches when fed with lossy-compressed data. This empirical study is essential for reassuring practitioners, but also for providing more insights on how compression techniques can even be effective in reducing noise in time series data. From a knowledge engineering perspective, we show that time series may be compressed by 90% using discrete wavelet transforms and still achieve remarkable classification ac- curacy, and that residual details left by popular wavelet compression techniques can sometimes even help achieve higher classification accuracy than the raw time series data, as they better capture essential local features. [less ▲] Detailed reference viewed: 591 (27 UL)![]() Li, Li ![]() ![]() ![]() Report (2016) Detailed reference viewed: 301 (20 UL)![]() Li, Li ![]() ![]() in The 31st ACM/SIGAPP Symposium on Applied Computing (SAC 2016) (2016, April) Despite much effort in the community, the momentum of Android research has not yet produced complete tools to perform thorough analysis on Android apps, leaving users vulnerable to malicious apps. Because ... [more ▼] Despite much effort in the community, the momentum of Android research has not yet produced complete tools to perform thorough analysis on Android apps, leaving users vulnerable to malicious apps. Because it is hard for a single tool to efficiently address all of the various challenges of Android programming which make analysis difficult, we propose to instrument the app code for reducing the analysis complexity, e.g., transforming a hard problem to a easy-resolvable one. To this end, we introduce in this paper Apkpler, a plugin-based framework for supporting such instrumentation. We evaluate Apkpler with two plugins, demonstrating the feasibility of our approach and showing that Apkpler can indeed be leveraged to reduce the analysis complexity of Android apps. [less ▲] Detailed reference viewed: 258 (9 UL)![]() Li, Daoyuan ![]() ![]() ![]() in The 2016 IEEE International Conference on Industrial Technology (ICIT 2016) (2016, March) Detailed reference viewed: 324 (38 UL)![]() Li, Li ![]() ![]() ![]() in The 2015 IEEE International Conference on Software Quality, Reliability and Security (QRS 2015) (2015, August) Detailed reference viewed: 579 (261 UL)![]() Li, Li ![]() ![]() ![]() Report (2015) We discuss the capability of a new feature set for malware detection based on potential component leaks (PCLs). PCLs are defined as sensitive data-flows that involve Android inter-component communications ... [more ▼] We discuss the capability of a new feature set for malware detection based on potential component leaks (PCLs). PCLs are defined as sensitive data-flows that involve Android inter-component communications. We show that PCLs are common in Android apps and that malicious applications indeed manipulate significantly more PCLs than benign apps. Then, we evaluate a machine learning-based approach relying on PCLs. Experimental validation show high performance with 95% precision for identifying malware, demonstrating that PCLs can be used for discriminating malicious apps from benign apps. By further investigating the generalization ability of this feature set, we highlight an issue often overlooked in the Android malware detection community: Qualitative aspects of training datasets have a strong impact on a malware detector’s performance. Furthermore, this impact cannot be overcome by simply increasing the Quantity of training material. [less ▲] Detailed reference viewed: 241 (2 UL) |
||