![]() Li, Daoyuan ![]() ![]() ![]() in Journal of Software: Evolution and Process (2019) One single code change can significantly influence a wide range of software systems and their users. For example, 1) adding a new feature can spread defects in several modules, while 2) changing an API ... [more ▼] One single code change can significantly influence a wide range of software systems and their users. For example, 1) adding a new feature can spread defects in several modules, while 2) changing an API method can improve the performance of all client programs. Developers often may not clearly know whether their or others’ changes are influential at commit time. Rather, it turns out to be influential after affecting many aspects of a system later. This paper investigates influential software changes and proposes an approach to identify them early, i.e., immediately when they are applied. We first conduct a post- mortem analysis to discover existing influential changes by using intuitions such as isolated changes and changes referred by other changes in 10 open source projects. Then we re-categorize all identified changes through an open-card sorting process. Subsequently, we conduct a survey with 89 developers to confirm our influential change categories. Finally, from our ground truth we extract features, including metrics such as the complexity of changes, terms in commit logs and file centrality in co-change graphs, to build ma- chine learning classifiers. The experiment results show that our prediction model achieves overall with random samples 86.8% precision, 74% recall and 80.4% F-measure respectively. [less ▲] Detailed reference viewed: 270 (22 UL)![]() Li, Li ![]() ![]() ![]() in Journal of Computer Science and Technology (2017) To devise efficient approaches and tools for detecting malicious packages in the Android ecosystem, researchers are increasingly required to have a deep understanding of malware. There is thus a need to ... [more ▼] To devise efficient approaches and tools for detecting malicious packages in the Android ecosystem, researchers are increasingly required to have a deep understanding of malware. There is thus a need to provide a framework for dissecting malware and locating malicious program fragments within app code in order to build a comprehensive dataset of malicious samples. Towards addressing this need, we propose in this work a tool-based approach called HookRanker, which provides ranked lists of potentially malicious packages based on the way malware behaviour code is triggered. With experiments on a ground truth of piggybacked apps, we are able to automatically locate the malicious packages from piggybacked Android apps with an accuracy@5 of 83.6% for such packages that are triggered through method invocations and an accuracy@5 of 82.2% for such packages that are triggered independently. [less ▲] Detailed reference viewed: 230 (10 UL)![]() Li, Li ![]() in The International Conference on Software Maintenance and Evolution (ICSME) (2017, September) This paper presents a retrospect of an Android app collection named AndroZoo and some research works conducted on top of the collection. AndroZoo is a growing collection of Android apps from various ... [more ▼] This paper presents a retrospect of an Android app collection named AndroZoo and some research works conducted on top of the collection. AndroZoo is a growing collection of Android apps from various markets including the official Google Play. At the moment, over five million Android apps have been collected. Based on AndroZoo, we have explored several directions that mine Android apps for resolving various challenges. In this work, we summarize those resolved mining challenges in three research dimensions, including code analysis, app evolution analysis, malware analysis, and present in each dimension several case studies that experimentally demonstrate the usefulness of AndroZoo. [less ▲] Detailed reference viewed: 215 (12 UL)![]() Li, Li ![]() ![]() ![]() Poster (2017, May) The Android packaging model offers adequate opportunities for attackers to inject malicious code into popular benign apps, attempting to develop new malicious apps that can then be easily spread to a ... [more ▼] The Android packaging model offers adequate opportunities for attackers to inject malicious code into popular benign apps, attempting to develop new malicious apps that can then be easily spread to a large user base. Despite the fact that the literature has already presented a number of tools to detect piggybacked apps, there is still lacking a comprehensive investigation on the piggybacking processes. To fill this gap, in this work, we collect a large set of benign/piggybacked app pairs that can be taken as benchmark apps for further investigation. We manually look into these benchmark pairs for understanding the characteristics of piggybacking apps and eventually we report 20 interesting findings. We expect these findings to initiate new research directions such as practical and scalable piggybacked app detection, explainable malware detection, and malicious code location. [less ▲] Detailed reference viewed: 289 (12 UL)![]() Li, Li ![]() ![]() ![]() in Abstract book of the 4th IEEE/ACM International Conference on Mobile Software Engineering and Systems (MobileSoft 2017) (2017, May) To devise efficient approaches and tools for detecting malicious packages in the Android ecosystem, researchers are increasingly required to have a deep understanding of malware. There is thus a need to ... [more ▼] To devise efficient approaches and tools for detecting malicious packages in the Android ecosystem, researchers are increasingly required to have a deep understanding of malware. There is thus a need to provide a framework for dissecting malware and locating malicious program fragments within app code in order to build a comprehensive dataset of malicious samples. Towards addressing this need, we propose in this work a tool-based approach called HookRanker, which provides ranked lists of potentially malicious packages based on the way malware behaviour code is triggered. With experiments on a ground truth set of piggybacked apps, we are able to automatically locate the malicious packages from piggybacked Android apps with an accuracy of 83.6% in verifying the top five reported items. [less ▲] Detailed reference viewed: 334 (23 UL)![]() Li, Li ![]() ![]() ![]() Poster (2017, May) App repackaging is a common threat in the Android ecosystem. To face this threat, the literature now includes a large body of work proposing approaches for identifying repackaged apps. Unfortunately ... [more ▼] App repackaging is a common threat in the Android ecosystem. To face this threat, the literature now includes a large body of work proposing approaches for identifying repackaged apps. Unfortunately, although most research involves pairwise similarity comparison to distinguish repackaged apps from their “original” counterparts, no work has considered the threat to validity of not being able to discover the true original apps. We provide in this paper preliminary insights of an investigation into the Multi-Generation Repackaging Hypothesis: is the original in a repackaging process the outcome of a previous repackaging process? Leveraging the Androzoo dataset of over 5 million Android apps, we validate this hypothesis in the wild, calling upon the community to take this threat into account in new solutions for repackaged app detection. [less ▲] Detailed reference viewed: 305 (10 UL)![]() Li, Li ![]() ![]() ![]() in IEEE Transactions on Information Forensics and Security (2017) The Android packaging model offers ample opportunities for malware writers to piggyback malicious code in popular apps, which can then be easily spread to a large user base. Although recent research has ... [more ▼] The Android packaging model offers ample opportunities for malware writers to piggyback malicious code in popular apps, which can then be easily spread to a large user base. Although recent research has produced approaches and tools to identify piggybacked apps, the literature lacks a comprehensive investigation into such phenomenon. We fill this gap by 1) systematically building a large set of piggybacked and benign apps pairs, which we release to the community, 2) empirically studying the characteristics of malicious piggybacked apps in comparison with their benign counterparts, and 3) providing insights on piggybacking processes. Among several findings providing insights, analysis techniques should build upon to improve the overall detection and classification accuracy of piggybacked apps, we show that piggybacking operations not only concern app code but also extensively manipulates app resource files, largely contradicting common beliefs. We also find that piggybacking is done with little sophistication, in many cases automatically, and often via library code. [less ▲] Detailed reference viewed: 356 (28 UL)![]() ; ; Li, Li ![]() in Information and Software Technology (2017) Context: State-of-the-art works on automated detection of Android malware have leveraged app descriptions to spot anomalies w.r.t the functionality implemented, or have used data flow information as a ... [more ▼] Context: State-of-the-art works on automated detection of Android malware have leveraged app descriptions to spot anomalies w.r.t the functionality implemented, or have used data flow information as a feature to discriminate malicious from benign apps. Although these works have yielded promising performance, we hypothesize that these performances can be improved by a better understanding of malicious behavior. Objective: To characterize malicious apps, we take into account both information on app descriptions, which are indicative of apps’ topics, and information on sensitive data flow, which can be relevant to discriminate malware from benign apps. Method: In this paper, we propose a topic-specific approach to malware comprehension based on app descriptions and data-flow information. First, we use an advanced topic model, adaptive LDA with GA, to cluster apps according to their descriptions. Then, we use information gain ratio of sensitive data flow information to build so-called “topic-specific data flow signatures”. Results: We conduct an empirical study on 3691 benign and 1612 malicious apps. We group them into 118 topics and generate topic-specific data flow signature. We verify the effectiveness of the topic-specific data flow signatures by comparing them with the overall data flow signature. In addition, we perform a deeper analysis on 25 representative topic-specific signatures and yield several implications. Conclusion: Topic-specific data flow signatures are efficient in highlighting the malicious behavior, and thus can help in characterizing malware. [less ▲] Detailed reference viewed: 259 (13 UL)![]() Li, Li ![]() ![]() ![]() in Information and Software Technology (2017) Context: Static analysis exploits techniques that parse program source code or bytecode, often traversing program paths to check some program properties. Static analysis approaches have been proposed for ... [more ▼] Context: Static analysis exploits techniques that parse program source code or bytecode, often traversing program paths to check some program properties. Static analysis approaches have been proposed for different tasks, including for assessing the security of Android apps, detecting app clones, automating test cases generation, or for uncovering non-functional issues related to performance or energy. The literature thus has proposed a large body of works, each of which attempts to tackle one or more of the several challenges that program analysers face when dealing with Android apps. Objective: We aim to provide a clear view of the state-of-the-art works that statically analyse Android apps, from which we highlight the trends of static analysis approaches, pinpoint where the focus has been put, and enumerate the key aspects where future researches are still needed. Method: We have performed a systematic literature review (SLR) which involves studying 124 research papers published in software engineering, programming languages and security venues in the last 5 years (January 2011 - December 2015). This review is performed mainly in five dimensions: problems targeted by the approach, fundamental techniques used by authors, static analysis sensitivities considered, android characteristics taken into account and the scale of evaluation performed. Results: Our in-depth examination has led to several key findings: 1) Static analysis is largely performed to uncover security and privacy issues; 2) The Soot framework and the Jimple intermediate representation are the most adopted basic support tool and format, respectively; 3) Taint analysis remains the most applied technique in research approaches; 4) Most approaches support several analysis sensitivities, but very few approaches consider path-sensitivity; 5) There is no single work that has been proposed to tackle all challenges of static analysis that are related to Android programming; and 6) Only a small portion of state-of-the-art works have made their artefacts publicly available. Conclusion: The research community is still facing a number of challenges for building approaches that are aware altogether of implicit-Flows, dynamic code loading features, reflective calls, native code and multi-threading, in order to implement sound and highly precise static analyzers. [less ▲] Detailed reference viewed: 444 (13 UL)![]() Li, Li ![]() ![]() ![]() in Abstract book of the 16th IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom) (2017) App updates and repackaging are recurrent in the Android ecosystem, filling markets with similar apps that must be identified and analysed to accelerate user adoption, improve development efforts, and ... [more ▼] App updates and repackaging are recurrent in the Android ecosystem, filling markets with similar apps that must be identified and analysed to accelerate user adoption, improve development efforts, and prevent malware spreading. Despite the existence of several approaches to improve the scalability of detecting repackaged/cloned apps, researchers and practitioners are eventually faced with the need for a comprehensive pairwise comparison to understand and validate the similarities among apps. This paper describes the design of SimiDroid, a framework for multi-level comparison of Android apps. SimiDroid is built with the aim to support the understanding of similarities/changes among app versions and among repackaged apps. In particular, we demonstrate the need and usefulness of such a framework based on different case studies implementing different analysing scenarios for revealing various insights on how repackaged apps are built. We further show that the similarity comparison plugins implemented in SimiDroid yield more accurate results than the state-of-the-art. [less ▲] Detailed reference viewed: 322 (10 UL)![]() ; Li, Li ![]() in The 2017 ACM on Asia Conference on Computer and Communications Security (AsiaCCS 2017) (2017) Reflection is a language feature which allows to analyze and transform the behavior of classes at the runtime. Reflection is used for software debugging and testing. Malware authors can leverage ... [more ▼] Reflection is a language feature which allows to analyze and transform the behavior of classes at the runtime. Reflection is used for software debugging and testing. Malware authors can leverage reflection to subvert the malware detection by static analyzers. Reflection initializes the class, invokes any method of class, or accesses any field of class. But, instead of utilizing usual programming language syntax, reflection passes classes/methods etc. as parameters to reflective APIs. As a consequence, these parameters can be constructed dynamically or can be encrypted by malware. These cannot be detected by state-of-the-art static tools. We propose EspyDroid, a system that combines dynamic analysis with code instrumentation for a more precise and automated detection of malware employing reflection. We evaluate EspyDroid on 28 benchmark apps employing major reflection categories. Our technique show improved results over FlowDroid via detection of additional undetected flows. These flows have potential to leak sensitive and private information of the users, through various sinks. [less ▲] Detailed reference viewed: 138 (5 UL)![]() Li, Li ![]() Doctoral thesis (2016) Within a few years, Android has been established as a leading platform in the mobile market with over one billion monthly active Android users. To serve these users, the official market, Google Play ... [more ▼] Within a few years, Android has been established as a leading platform in the mobile market with over one billion monthly active Android users. To serve these users, the official market, Google Play, hosts around 2 million apps which have penetrated into a variety of user activities and have played an essential role in their daily life. However, this penetration has also opened doors for malicious apps, presenting big threats that can lead to severe damages. To alleviate the security threats posed by Android apps, the literature has proposed a large body of works which propose static and dynamic approaches for identifying and managing security issues in the mobile ecosystem. Static analysis in particular, which does not require to actually execute code of Android apps, has been used extensively for market-scale analysis. In order to have a better understanding on how static analysis is applied, we conduct a systematic literature review (SLR) of related researches for Android. We studied influential research papers published in the last five years (from 2011 to 2015). Our in-depth examination on those papers reveals, among other findings, that static analysis is largely performed to uncover security and privacy issues. The SLR also highlights that no single work has been proposed to tackle all the challenges for static analysis of Android apps. Existing approaches indeed fail to yield sound results in various analysis cases, given the different specificities of Android programming. Our objective is thus to reduce the analysis complexity of Android apps in a way that existing approaches can also succeed on their failed cases. To this end, we propose to instrument the app code for transforming a given hard problem to an easily-resolvable one (e.g., reducing an inter-app analysis problem to an intra-app analysis problem). As a result, our code instrumentation boosts existing static analyzers in a non-invasive manner (i.e., no need to modify those analyzers). In this dissertation, we apply code instrumentation to solve three well-known challenges of static analysis of Android apps, allowing existing static security analyses to 1) be inter-component communication (ICC) aware; 2) be reflection aware; and 3) cut out common libraries. ICC is a challenge for static analysis. Indeed, the ICC mechanism is driven at the framework level rather than the app level, leaving it invisible to app-targeted static analyzers. As a consequence, static analyzers can only build an incomplete control-flow graph (CFG) which prevents a sound analysis. To support ICC-aware analysis, we devise an approach called IccTA, which instruments app code by adding glue code that directly connects components using traditional Java class access mechanism (e.g., explicit new instantiation of target components). Reflection is a challenge for static analysis as well because it also confuses the analysis context. To support reflection-aware analysis, we provide DroidRA, a tool-based approach, which instruments Android apps to explicitly replace reflective calls with their corresponding traditional Java calls. The mapping from reflective calls to traditional Java calls is inferred through a solver, where the resolution of reflective calls is reduced to a composite constant propagation problem. Libraries are pervasively used in Android apps. On the one hand, their presence increases time/memory consumption of static analysis. On the other hand, they may lead to false positives and false negatives for static approaches (e.g., clone detection and machine learning-based malware detection). To mitigate this, we propose to instrument Android apps to cut out a set of automatically identified common libraries from the app code, so as to improve static analyzer’s performance in terms of time/memory as well as accuracy. To sum up, in this dissertation, we leverage code instrumentation to boost existing static analyzers, allowing them to yield more sound results and to perform quicker analyses. Thanks to the afore- mentioned approaches, we are now able to automatically identify malicious apps. However, it is still unknown how malicious payloads are introduced into those malicious apps. As a perspective for our future research, we conduct a thorough dissection on piggybacked apps (whose malicious payloads are easily identifiable) in the end of this dissertation, in an attempt to understand how malicious apps are actually built. [less ▲] Detailed reference viewed: 519 (38 UL)![]() Li, Li ![]() ![]() ![]() in The 32nd International Conference on Software Maintenance and Evolution (ICSME) (2016, October) As Android becomes a de-facto choice of development platform for mobile apps, developers extensively leverage its accompanying Software Development Kit to quickly build their apps. This SDK comes with a ... [more ▼] As Android becomes a de-facto choice of development platform for mobile apps, developers extensively leverage its accompanying Software Development Kit to quickly build their apps. This SDK comes with a set of APIs which developers may find limited in comparison to what system apps can do or what framework developers are preparing to harness capabilities of new generation devices. Thus, developers may attempt to explore in advance the normally “inaccessible” APIs for building unique API-based functionality in their app. The Android programming model is unique in its kind. Inaccessible APIs, which however are used by developers, constitute yet another specificity of Android development, and is worth investigating to understand what they are, how they evolve over time, and who uses them. To that end, in this work, we empirically investigate 17 important releases of the Android framework source code base, and we find that inaccessible APIs are commonly implemented in the Android framework, which are further neither forward nor backward compatible. Moreover, a small set of inaccessible APIs can eventually become publicly accessible, while most of them are removed during the evolution, resulting in risks for such apps that have leveraged inaccessible APIs. Finally, we show that inaccessible APIs are indeed accessed by third-party apps, and the official Google Play store has tolerated the proliferation of apps leveraging inaccessible API methods. [less ▲] Detailed reference viewed: 298 (9 UL)![]() Li, Li ![]() ![]() in The 31st IEEE/ACM International Conference on Automated Software (ASE) (2016, September) We demonstrate the benefits of DroidRA, a tool for taming reflection in Android apps. DroidRA first statically extracts reflection-related object values from a given Android app. Then, it leverages the ... [more ▼] We demonstrate the benefits of DroidRA, a tool for taming reflection in Android apps. DroidRA first statically extracts reflection-related object values from a given Android app. Then, it leverages the extracted values to boost the app in a way that reflective calls are no longer a challenge for existing static analyzers. This is achieved through a bytecode instrumentation approach, where reflective calls are supplemented with explicit traditional Java method calls which can be followed by state-of-the-art analyzers which do not handle reflection. Instrumented apps can thus be completely analyzed by existing static analyzers, which are no longer required to be modified to support reflection-aware analysis. The video demo of DroidRA can be found at https://youtu.be/-HW0V68aAWc [less ▲] Detailed reference viewed: 194 (4 UL)![]() Li, Daoyuan ![]() ![]() ![]() in 12th International Conference on Machine Learning and Data Mining (MLDM 2016) (2016, July) Time series data are abundant in various domains and are often characterized as large in size and high in dimensionality, leading to storage and processing challenges. Symbolic representation of time ... [more ▼] Time series data are abundant in various domains and are often characterized as large in size and high in dimensionality, leading to storage and processing challenges. Symbolic representation of time series – which transforms numeric time series data into texts – is a promising technique to address these challenges. However, these techniques are essentially lossy compression functions and information are partially lost during transformation. To that end, we bring up a novel approach named Domain Series Corpus (DSCo), which builds per-class language models from the symbolized texts. To classify unlabeled samples, we compute the fitness of each symbolized sample against all per-class models and choose the class represented by the model with the best fitness score. Our work innovatively takes advantage of mature techniques from both time series mining and NLP communities. Through extensive experiments on an open dataset archive, we demonstrate that it performs similarly to approaches working with original uncompressed numeric data. [less ▲] Detailed reference viewed: 387 (29 UL)![]() Li, Li ![]() ![]() in The 2016 International Symposium on Software Testing and Analysis (2016, July) Android developers heavily use reflection in their apps for legitimate reasons, but also significantly for hiding malicious actions. Unfortunately, current state-of-the-art static analysis tools for ... [more ▼] Android developers heavily use reflection in their apps for legitimate reasons, but also significantly for hiding malicious actions. Unfortunately, current state-of-the-art static analysis tools for Android are challenged by the presence of reflective calls which they usually ignore. Thus, the results of their security analysis, e.g., for private data leaks, are inconsistent given the measures taken by malware writers to elude static detection. We propose the DroidRA instrumentation-based approach to address this issue in a non-invasive way. With DroidRA, we reduce the resolution of reflective calls to a composite constant propagation problem. We leverage the COAL solver to infer the values of reflection targets and app, and we eventually instrument this app to include the corresponding traditional Java call for each reflective call. Our approach allows to boost an app so that it can be immediately analyzable, including by such static analyzers that were not reflection-aware. We evaluate DroidRA on benchmark apps as well as on real-world apps, and demonstrate that it can allow state-of-the-art tools to provide more sound and complete analysis results. [less ▲] Detailed reference viewed: 254 (7 UL)![]() Li, Li ![]() in The Doctoral Symposium of 38th International Conference on Software Engineering (ICSE-DS 2016) (2016, May) Static analysis has been applied to dissect Android apps for many years. The main advantage of using static analysis is its efficiency and entire code coverage characteristics. However, the community has ... [more ▼] Static analysis has been applied to dissect Android apps for many years. The main advantage of using static analysis is its efficiency and entire code coverage characteristics. However, the community has not yet produced complete tools to perform in-depth static analysis, putting users at risk to malicious apps. Because of the diverse challenges caused by Android apps, it is hard for a single tool to efficiently address all of them. Thus, in this work, we propose to boost static analysis of Android apps through code instrumentation, in which the knotty code can be reduced or simplified into an equivalent but analyzable code. Consequently, existing static analyzers, without any modification, can be leveraged to perform extensive analysis, although originally they cannot. Previously, we have successfully applied instrumentation for two challenges of static analysis of Android apps: Inter-Component Communication (ICC) and Reflection. However, these two case studies are implemented separately and the implementation is not reusable, letting some functionality, that could be reused from one to another, be reinvented and thus lots of resources are wasted. To this end, in this work, we aim at providing a generic and non-invasive approach for existing static analyzers, enabling them to perform more broad analysis. [less ▲] Detailed reference viewed: 225 (6 UL)![]() Li, Li ![]() ![]() ![]() Report (2016) Context: Static analysis approaches have been proposed to assess the security of Android apps, by searching for known vulnerabilities or actual malicious code. The literature thus has proposed a large ... [more ▼] Context: Static analysis approaches have been proposed to assess the security of Android apps, by searching for known vulnerabilities or actual malicious code. The literature thus has proposed a large body of works, each of which attempts to tackle one or more of the several challenges that program analyzers face when dealing with Android apps. Objective: We aim to provide a clear view of the state-of-the-art works that statically analyze Android apps, from which we highlight the trends of static analysis approaches, pinpoint where the focus has been put and enumerate the key aspects where future researches are still needed. Method: We have performed a systematic literature review which involves studying around 90 research papers published in software engineering, programming languages and security venues. This review is performed mainly in five dimensions: problems targeted by the approach, fundamental techniques used by authors, static analysis sensitivities considered, android characteristics taken into account and the scale of evaluation performed. Results: Our in-depth examination have led to several key findings: 1) Static analysis is largely performed to uncover security and privacy issues; 2) The Soot framework and the Jimple intermediate representation are the most adopted basic support tool and format, respectively; 3) Taint analysis remains the most applied technique in research approaches; 4) Most approaches support several analysis sensitivities, but very few approaches consider path-sensitivity; 5) There is no single work that has been proposed to tackle all challenges of static analysis that are related to Android programming; and 6) Only a small portion of state-of-the-art works have made their artifacts publicly available. Conclusion: The research community is still facing a number of challenges for building approaches that are aware altogether of implicit-Flows, dynamic code loading features, reflective calls, native code and multi-threading, in order to implement sound and highly precise static analyzers. [less ▲] Detailed reference viewed: 1154 (30 UL)![]() Li, Li ![]() ![]() ![]() Report (2016) Detailed reference viewed: 301 (20 UL)![]() Li, Li ![]() ![]() in The 31st ACM/SIGAPP Symposium on Applied Computing (SAC 2016) (2016, April) Despite much effort in the community, the momentum of Android research has not yet produced complete tools to perform thorough analysis on Android apps, leaving users vulnerable to malicious apps. Because ... [more ▼] Despite much effort in the community, the momentum of Android research has not yet produced complete tools to perform thorough analysis on Android apps, leaving users vulnerable to malicious apps. Because it is hard for a single tool to efficiently address all of the various challenges of Android programming which make analysis difficult, we propose to instrument the app code for reducing the analysis complexity, e.g., transforming a hard problem to a easy-resolvable one. To this end, we introduce in this paper Apkpler, a plugin-based framework for supporting such instrumentation. We evaluate Apkpler with two plugins, demonstrating the feasibility of our approach and showing that Apkpler can indeed be leveraged to reduce the analysis complexity of Android apps. [less ▲] Detailed reference viewed: 258 (9 UL) |
||