![]() Titcheu Chekam, Thierry ![]() ![]() ![]() in Empirical Software Engineering (in press) Detailed reference viewed: 281 (35 UL)![]() ; ; et al in Empirical Software Engineering (in press) Regression testing is an essential activity to assure that software code changes do not adversely a ect existing functionalities. With the wide adoption of Continuous Integration (CI) in software projects ... [more ▼] Regression testing is an essential activity to assure that software code changes do not adversely a ect existing functionalities. With the wide adoption of Continuous Integration (CI) in software projects, which increases the frequency of running software builds, running all tests can be time-consuming and resource-intensive. To alleviate that problem, Test case Selection and Prioritiza- tion (TSP) techniques have been proposed to improve regression testing by selecting and prioritizing test cases in order to provide early feedback to developers. In recent years, researchers have relied on Machine Learning (ML) techniques to achieve e ective TSP (ML-based TSP). Such techniques help combine information about test cases, from partial and imperfect sources, into accurate prediction models. This work conducts a systematic literature review focused on ML-based TSP techniques, aiming to perform an in-depth analysis of the state of the art, thus gaining insights regarding fu- ture avenues of research. To that end, we analyze 29 primary studies published from 2006 to 2020, which have been identi ed through a systematic and documented process. This paper addresses ve research questions addressing variations in ML-based TSP techniques and feature sets for training and testing ML models, alternative metrics used for evaluating the techniques, the performance of techniques, and the reproducibility of the published studies. We summarize the results related to our research questions in a high-level summary that can be used as a taxonomy for classifying future TSP studies. [less ▲] Detailed reference viewed: 165 (23 UL)![]() Daoudi, Nadia ![]() ![]() in Empirical Software Engineering (2022), 28 Research on Android malware detection based on Machine learning has been prolific in recent years. In this paper, we show, through a large-scale evaluation of four state-of-the-art approaches that their ... [more ▼] Research on Android malware detection based on Machine learning has been prolific in recent years. In this paper, we show, through a large-scale evaluation of four state-of-the-art approaches that their achieved performance fluctuates when applied to different datasets. Combining existing approaches appears as an appealing method to stabilise performance. We therefore proceed to empirically investigate the effect of such combinations on the overall detection performance. In our study, we evaluated 22 methods to combine feature sets or predictions from the state-of-the-art approaches. Our results showed that no method has significantly enhanced the detection performance reported by the state-of-the-art malware detectors. Nevertheless, the performance achieved is on par with the best individual classifiers for all settings. Overall, we conduct extensive experiments on the opportunity to combine state-of-the-art detectors. Our main conclusion is that combining state-of-theart malware detectors leads to a stabilisation of the detection performance, and a research agenda on how they should be combined effectively is required to boost malware detection. All artefacts of our large-scale study (i.e., the dataset of ∼0.5 million apks and all extracted features) are made available for replicability. [less ▲] Detailed reference viewed: 29 (4 UL)![]() Garg, Aayush ![]() ![]() ![]() in Empirical Software Engineering (2022) Detailed reference viewed: 166 (27 UL)![]() Lee, Jaekwon ![]() ![]() in Empirical Software Engineering (2022), 27 In real-time systems, priorities assigned to real-time tasks determine the order of task executions, by relying on an underlying task scheduling policy. Assigning optimal priority values to tasks is ... [more ▼] In real-time systems, priorities assigned to real-time tasks determine the order of task executions, by relying on an underlying task scheduling policy. Assigning optimal priority values to tasks is critical to allow the tasks to complete their executions while maximizing safety margins from their specified deadlines. This enables real-time systems to tolerate unexpected overheads in task executions and still meet their deadlines. In practice, priority assignments result from an interactive process between the development and testing teams. In this article, we propose an automated method that aims to identify the best possible priority assignments in real-time systems, accounting for multiple objectives regarding safety margins and engineering constraints. Our approach is based on a multi-objective, competitive coevolutionary algorithm mimicking the interactive priority assignment process between the development and testing teams. We evaluate our approach by applying it to six industrial systems from different domains and several synthetic systems. The results indicate that our approach significantly outperforms both our baselines, i.e., random search and sequential search, and solutions defined by practitioners. Our approach scales to complex industrial systems as an offline analysis method that attempts to find near-optimal solutions within acceptable time, i.e., less than 16 hours. [less ▲] Detailed reference viewed: 125 (33 UL)![]() ; Bissyande, Tegawendé François D Assise ![]() in Empirical Software Engineering (2022), 27 Detailed reference viewed: 46 (0 UL)![]() Ojdanic, Milos ![]() in Empirical Software Engineering (2022), 27 Applying mutation testing to test subtle program changes, such as program patches or other small-scale code modifications, requires using mutants that capture the delta of the altered behaviours. To ... [more ▼] Applying mutation testing to test subtle program changes, such as program patches or other small-scale code modifications, requires using mutants that capture the delta of the altered behaviours. To address this issue, we introduce the concept of commit-relevant mutants, which are the mutants that interact with the behaviours of the system affected by a particular commit. Therefore, commit-aware mutation testing, is a test assessment metric tailored to a specific commit. By analysing 83 commits from 25 projects involving 2,253,610 mutants in both C and Java, we identify the commit-relevant mutants and explore their relationship with other categories of mutants. Our results show that commit-relevant mutants represent a small subset of all mutants, which differs from the other classes of mutants (subsuming and hard-to-kill), and that the commit-relevant mutation score is weakly correlated with the traditional mutation score (Kendall/Pearson 0.15-0.4). Moreover, commit-aware mutation analysis provides insights about the testing of a commit, which can be more efficient than the classical mutation analysis; in our experiments, by analysing the same number of mutants, commit-aware mutants have better fault-revelation potential (30% higher chances of revealing commit-introducing faults) than traditional mutants. We also illustrate a possible application of commit-aware mutation testing as a metric to evaluate test case prioritisation. [less ▲] Detailed reference viewed: 15 (3 UL)![]() Shin, Donghwan ![]() ![]() ![]() in Empirical Software Engineering (2022) Behavioral software models play a key role in many software engineering tasks; unfortunately, these models either are not available during software development or, if available, quickly become outdated as ... [more ▼] Behavioral software models play a key role in many software engineering tasks; unfortunately, these models either are not available during software development or, if available, quickly become outdated as implementations evolve. Model inference techniques have been proposed as a viable solution to extract finite state models from execution logs. However, existing techniques do not scale well when processing very large logs that can be commonly found in practice. In this paper, we address the scalability problem of inferring the model of a component-based system from large system logs, without requiring any extra information. Our model inference technique, called PRINS, follows a divide-and-conquer approach. The idea is to first infer a model of each system component from the corresponding logs; then, the individual component models are merged together taking into account the flow of events across components, as reflected in the logs. We evaluated PRINS in terms of scalability and accuracy, using nine datasets composed of logs extracted from publicly available benchmarks and a personal computer running desktop business applications. The results show that PRINS can process large logs much faster than a publicly available and well-known state-of-the-art tool, without significantly compromising the accuracy of inferred models. [less ▲] Detailed reference viewed: 367 (34 UL)![]() ; ; et al in Empirical Software Engineering (2022), 27 Detailed reference viewed: 42 (0 UL)![]() Ul Haq, Fitash ![]() ![]() ![]() in Empirical Software Engineering (2021), 26(5), We distinguish two general modes of testing for Deep Neural Networks (DNNs): Offline testing where DNNs are tested as individual units based on test datasets obtained without involving the DNNs under test ... [more ▼] We distinguish two general modes of testing for Deep Neural Networks (DNNs): Offline testing where DNNs are tested as individual units based on test datasets obtained without involving the DNNs under test, and online testing where DNNs are embedded into a specific application environment and tested in a closed-loop mode in interaction with the application environment. Typically, DNNs are subjected to both types of testing during their development life cycle where offline testing is applied immediately after DNN training and online testing follows after offline testing and once a DNN is deployed within a specific application environment. In this paper, we study the relationship between offline and online testing. Our goal is to determine how offline testing and online testing differ or complement one another and if offline testing results can be used to help reduce the cost of online testing? Though these questions are generally relevant to all autonomous systems, we study them in the context of automated driving systems where, as study subjects, we use DNNs automating end-to-end controls of steering functions of self-driving vehicles. Our results show that offline testing is less effective than online testing as many safety violations identified by online testing could not be identified by offline testing, while large prediction errors generated by offline testing always led to severe safety violations detectable by online testing. Further, we cannot exploit offline testing results to reduce the cost of online testing in practice since we are not able to identify specific situations where offline testing could be as accurate as online testing in identifying safety requirement violations. [less ▲] Detailed reference viewed: 149 (32 UL)![]() Veizaga Campero, Alvaro Mario ![]() ![]() ![]() in Empirical Software Engineering (2021), 26(4), 79 [Context] Natural language (NL) is pervasive in software requirements specifications (SRSs). However, despite its popularity and widespread use, NL is highly prone to quality issues such as vagueness ... [more ▼] [Context] Natural language (NL) is pervasive in software requirements specifications (SRSs). However, despite its popularity and widespread use, NL is highly prone to quality issues such as vagueness, ambiguity, and incompleteness. Controlled natural languages (CNLs) have been proposed as a way to prevent quality problems in requirements documents, while maintaining the flexibility to write and communicate requirements in an intuitive and universally understood manner. [Objective] In collaboration with an industrial partner from the financial domain, we systematically develop and evaluate a CNL, named Rimay, intended at helping analysts write functional requirements. [Method] We rely on Grounded Theory for building Rimay and follow well-known guidelines for conducting and reporting industrial case study research. [Results] Our main contributions are: (1) a qualitative methodology to systematically define a CNL for functional requirements; this methodology is intended to be general for use across information-system domains, (2) a CNL grammar to represent functional requirements; this grammar is derived from our experience in the financial domain, but should be applicable, possibly with adaptations, to other information-system domains, and (3) an empirical evaluation of our CNL (Rimay) through an industrial case study. Our contributions draw on 15 representative SRSs, collectively containing 3215 NL requirements statements from the financial domain. [Conclusion] Our evaluation shows that Rimay is expressive enough to capture, on average, 88% (405 out of 460) of the NL requirements statements in four previously unseen SRSs from the financial domain. [less ▲] Detailed reference viewed: 649 (56 UL)![]() Soremekun, Ezekiel ![]() in Empirical Software Engineering (2021), 26(3), 1-45 Statistical fault localization is an easily deployed technique for quickly determining candidates for faulty code locations. If a human programmer has to search the fault beyond the top candidate ... [more ▼] Statistical fault localization is an easily deployed technique for quickly determining candidates for faulty code locations. If a human programmer has to search the fault beyond the top candidate locations, though, more traditional techniques of following dependencies along dynamic slices may be better suited. In a large study of 457 bugs (369 single faults and 88 multiple faults) in 46 open source C programs, we compare the effectiveness of statistical fault localization against dynamic slicing. For single faults, we find that dynamic slicing was eight percentage points more effective than the best performing statistical debugging formula; for 66% of the bugs, dynamic slicing finds the fault earlier than the best performing statistical debugging formula. In our evaluation, dynamic slicing is more effective for programs with single fault, but statistical debugging performs better on multiple faults. Best results, however, are obtained by a hybrid approach: If programmers first examine at most the top five most suspicious locations from statistical debugging, and then switch to dynamic slices, on average, they will need to examine 15% (30 lines) of the code. These findings hold for 18 most effective statistical debugging formulas and our results are independent of the number of faults (i.e. single or multiple faults) and error type (i.e. artificial or real errors). [less ▲] Detailed reference viewed: 30 (1 UL)![]() Riom, Timothée ![]() ![]() ![]() in Empirical Software Engineering (2021), 26 Detecting vulnerabilities in software is a constant race between development teams and potential attackers. While many static and dynamic approaches have focused on regularly analyzing the software in its ... [more ▼] Detecting vulnerabilities in software is a constant race between development teams and potential attackers. While many static and dynamic approaches have focused on regularly analyzing the software in its entirety, a recent research direction has focused on the analysis of changes that are applied to the code. VCCFinder is a seminal approach in the literature that builds on machine learning to automatically detect whether an incoming commit will introduce some vulnerabilities. Given the influence of VCCFinder in the literature, we undertake an investigation into its performance as a state-of-the-art system. To that end, we propose to attempt a replication study on the VCCFinder supervised learning approach. The insights of our failure to replicate the results reported in the original publication informed the design of a new approach to identify vulnerability-contributing commits based on a semi-supervised learning technique with an alternate feature set. We provide all artefacts and a clear description of this approach as a new reproducible baseline for advancing research on machine learning-based identification of vulnerability-introducing commits [less ▲] Detailed reference viewed: 112 (13 UL)![]() Sleimi, Amin ![]() ![]() ![]() in Empirical Software Engineering (2021), 26(3), 43 Semantic legal metadata provides information that helps with understanding and interpreting legal provisions. Such metadata is therefore important for the systematic analysis of legal requirements ... [more ▼] Semantic legal metadata provides information that helps with understanding and interpreting legal provisions. Such metadata is therefore important for the systematic analysis of legal requirements. However, manually enhancing a large legal corpus with semantic metadata is prohibitively expensive. Our work is motivated by two observations: (1) the existing requirements engineering (RE) literature does not provide a harmonized view on the semantic metadata types that are useful for legal requirements analysis; (2) automated support for the extraction of semantic legal metadata is scarce, and it does not exploit the full potential of artificial intelligence technologies, notably natural language processing (NLP) and machine learning (ML). Our objective is to take steps toward overcoming these limitations. To do so, we review and reconcile the semantic legal metadata types proposed in the RE literature. Subsequently, we devise an automated extraction approach for the identified metadata types using NLP and ML. We evaluate our approach through two case studies over the Luxembourgish legislation. Our results indicate a high accuracy in the generation of metadata annotations. In particular, in the two case studies, we were able to obtain precision scores of 97,2% and 82,4%, and recall scores of 94,9% and 92,4%. [less ▲] Detailed reference viewed: 191 (22 UL)![]() Samhi, Jordan ![]() ![]() ![]() in Empirical Software Engineering (2021) Due to the convenience of access-on-demand to information and business solutions, mobile apps have become an important asset in the digital world. In the context of the Covid-19 pandemic, app developers ... [more ▼] Due to the convenience of access-on-demand to information and business solutions, mobile apps have become an important asset in the digital world. In the context of the Covid-19 pandemic, app developers have joined the response effort in various ways by releasing apps that target different user bases (e.g., all citizens or journalists), offer different services (e.g., location tracking or diagnostic-aid), provide generic or specialized information, etc. While many apps have raised some concerns by spreading misinformation or even malware, the literature does not yet provide a clear landscape of the different apps that were developed. In this study, we focus on the Android ecosystem and investigate Covid-related Android apps. In a best-effort scenario, we attempt to systematically identify all relevant apps and study their characteristics with the objective to provide a First taxonomy of Covid related apps, broadening the relevance beyond the implementation of contact tracing. Overall, our study yields a number of empirical insights that contribute to enlarge the knowledge on Covid-related apps: (1) Developer communities contributed rapidly to the Covid-19, with dedicated apps released as early as January 2020; (2) Covid-related apps deliver digital tools to users (e.g., health diaries), serve to broadcast information to users (e.g., spread statistics), and collect data from users (e.g., for tracing); (3) Covid-related apps are less complex than standard apps; (4) they generally do not seem to leak sensitive data; (5) in the majority of cases, Covid-related apps are released by entities with past experience on the market, mostly official government entities or public health organizations. [less ▲] Detailed reference viewed: 110 (40 UL)![]() Daoudi, Nadia ![]() ![]() ![]() in Empirical Software Engineering (2021), 26 A well-known curse of computer security research is that it often produces systems that, while technically sound, fail operationally. To overcome this curse, the community generally seeks to assess ... [more ▼] A well-known curse of computer security research is that it often produces systems that, while technically sound, fail operationally. To overcome this curse, the community generally seeks to assess proposed systems under a variety of settings in order to make explicit every potential bias. In this respect, recently, research achievements on machine learning based malware detection are being considered for thorough evaluation by the community. Such an effort of comprehensive evaluation supposes first and foremost the possibility to perform an independent reproduction study in order to sharpen evaluations presented by approaches’ authors. The question Can published approaches actually be reproduced? thus becomes paramount despite the little interest such mundane and practical aspects seem to attract in the malware detection field. In this paper, we attempt a complete reproduction of five Android Malware Detectors from the literature and discuss to what extent they are “reproducible”. Notably, we provide insights on the implications around the guesswork that may be required to finalise a working implementation. Finally, we discuss how barriers to reproduction could be lifted, and how the malware detection field would benefit from stronger reproducibility standards—like many various fields already have. [less ▲] Detailed reference viewed: 300 (30 UL)![]() ; ; et al in Empirical Software Engineering (2021), 26(6), 1--33 Detailed reference viewed: 44 (7 UL)![]() Iannillo, Antonio Ken ![]() in Empirical Software Engineering (2020) Detailed reference viewed: 93 (2 UL)![]() Koyuncu, Anil ![]() ![]() ![]() in Empirical Software Engineering (2020) Patching is a common activity in software development. It is generally performed on a source code base to address bugs or add new functionalities. In this context, given the recurrence of bugs across ... [more ▼] Patching is a common activity in software development. It is generally performed on a source code base to address bugs or add new functionalities. In this context, given the recurrence of bugs across projects, the associated similar patches can be leveraged to extract generic fix actions. While the literature includes various approaches leveraging similarity among patches to guide program repair, these approaches often do not yield fix patterns that are tractable and reusable as actionable input to APR systems. In this paper, we propose a systematic and automated approach to mining relevant and actionable fix patterns based on an iterative clustering strategy applied to atomic changes within patches. The goal of FixMiner is thus to infer separate and reusable fix patterns that can be leveraged in other patch generation systems. Our technique, FixMiner, leverages Rich Edit Script which is a specialized tree structure of the edit scripts that captures the ASTlevel context of the code changes. FixMiner uses different tree representations of Rich Edit Scripts for each round of clustering to identify similar changes. These are abstract syntax trees, edit actions trees, and code context trees. We have evaluated FixMiner on thousands of software patches collected from open source projects. Preliminary results show that we are able to mine accurate patterns, efficiently exploiting change information in Rich Edit Scripts. We further integrated the mined patterns to an automated program repair prototype, PARFixMiner, with which we are able to correctly fix 26 bugs of the Defects4J benchmark. Beyond this quantitative performance, we show that the mined fix patterns are sufficiently relevant to produce patches with a high probability of correctness: 81% of PARFixMiner’s generated plausible patches are correct. [less ▲] Detailed reference viewed: 139 (7 UL)![]() ; Gao, Jun ![]() ![]() in Empirical Software Engineering (2020), 24(118), 1-41 Because of functionality evolution, or security and performance-related changes, some APIs eventually become unnecessary in a software system and thus need to be cleaned to ensure proper maintainability ... [more ▼] Because of functionality evolution, or security and performance-related changes, some APIs eventually become unnecessary in a software system and thus need to be cleaned to ensure proper maintainability. Those APIs are typically marked first as deprecated APIs and, as recommended, follow through a deprecated-replace-remove cycle, giving an opportunity to client application developers to smoothly adapt their code in next updates. Such a mechanism is adopted in the Android framework development where thousands of reusable APIs are made available to Android app developers. In this work, we present a research-based prototype tool called CDA and apply it to different revisions (i.e., releases or tags) of the Android framework code for characterising deprecated APIs. Based on the data mined by CDA, we then perform an empirical study on API deprecation in the Android ecosystem and the associated challenges for maintaining quality apps. In particular, we investigate the prevalence of deprecated APIs, their annotations and documentation, their removal and consequences, their replacement messages, developer reactions to API deprecation, as well as the evolution of the usage of deprecated APIs. Experimental results reveal several findings that further provide promising insights related to deprecated Android APIs. Notably, by mining the source code of the Android framework base, we have identified three bugs related to deprecated APIs. These bugs have been quickly assigned and positively appreciated by the framework maintainers, who claim that these issues will be updated in future releases. [less ▲] Detailed reference viewed: 71 (2 UL) |
||