References of "Papadakis, Mike 50002811"
     in
Bookmark and Share    
Full Text
Peer Reviewed
See detailSelecting Fault Revealing Mutants
Titcheu Chekam, Thierry UL; Papadakis, Mike UL; Bissyande, Tegawendé François D Assise UL et al

in Empirical Software Engineering (in press)

Detailed reference viewed: 269 (35 UL)
Full Text
Peer Reviewed
See detailCODEBERT-NT: code naturalness via CodeBERT
Khanfir, Ahmed UL; Jimenez, Matthieu UL; Papadakis, Mike UL et al

in 22nd IEEE International Conference on Software Quality, Reliability and Security (QRS'22) (2022, December 05)

Much of recent software-engineering research has investigated the naturalness of code, the fact that code, in small code snippets, is repetitive and can be predicted using statistical language models like ... [more ▼]

Much of recent software-engineering research has investigated the naturalness of code, the fact that code, in small code snippets, is repetitive and can be predicted using statistical language models like n-gram. Although powerful, training such models on large code corpus can be tedious, time consuming and sensitive to code patterns (and practices) encountered during training. Consequently, these models are often trained on a small corpus and thus only estimate the language naturalness relative to a specific style of programming or type of project. To overcome these issues, we investigate the use of pre-trained generative language models to infer code naturalness. Pre-trained models are often built on big data, are easy to use in an out-of-the-box way and include powerful learning associations mechanisms. Our key idea is to quantify code naturalness through its predictability, by using state-of-the-art generative pre-trained language models. Thus, we suggest to infer naturalness by masking (omitting) code tokens, one at a time, of code-sequences, and checking the models’ability to predict them. We explore three different predictability metrics; a) measuring the number of exact matches of the predictions, b) computing the embedding similarity between the original and predicted code, i.e., similarity at the vector space, and c) computing the confidence of the model when doing the token completion task regardless of the outcome. We implement this workflow, named CODEBERT-NT, and evaluate its capability to prioritize buggy lines over non-buggy ones when ranking code based on its naturalness. Our results, on 2,510 buggy versions of 40 projects from the SmartShark dataset, show that CODEBERT-NT outperforms both, random-uniform and complexity-based ranking techniques, and yields comparable results to the n-gram models. [less ▲]

Detailed reference viewed: 65 (10 UL)
Full Text
Peer Reviewed
See detailOn the use of commit-relevant mutants
Ojdanic, Milos UL; Ma, Wei; Laurent, Thomas et al

in Empirical Software Engineering (2022), 27

Applying mutation testing to test subtle program changes, such as program patches or other small-scale code modifications, requires using mutants that capture the delta of the altered behaviours. To ... [more ▼]

Applying mutation testing to test subtle program changes, such as program patches or other small-scale code modifications, requires using mutants that capture the delta of the altered behaviours. To address this issue, we introduce the concept of commit-relevant mutants, which are the mutants that interact with the behaviours of the system affected by a particular commit. Therefore, commit-aware mutation testing, is a test assessment metric tailored to a specific commit. By analysing 83 commits from 25 projects involving 2,253,610 mutants in both C and Java, we identify the commit-relevant mutants and explore their relationship with other categories of mutants. Our results show that commit-relevant mutants represent a small subset of all mutants, which differs from the other classes of mutants (subsuming and hard-to-kill), and that the commit-relevant mutation score is weakly correlated with the traditional mutation score (Kendall/Pearson 0.15-0.4). Moreover, commit-aware mutation analysis provides insights about the testing of a commit, which can be more efficient than the classical mutation analysis; in our experiments, by analysing the same number of mutants, commit-aware mutants have better fault-revelation potential (30% higher chances of revealing commit-introducing faults) than traditional mutants. We also illustrate a possible application of commit-aware mutation testing as a metric to evaluate test case prioritisation. [less ▲]

Detailed reference viewed: 13 (3 UL)
Full Text
Peer Reviewed
See detailGraphCode2Vec: generic code embedding via lexical and program dependence analyses
Ma, Wei UL; Zhao, Mengjie; Soremekun, Ezekiel UL et al

in Proceedings of the 19th International Conference on Mining Software Repositories (2022, May 22)

Code embedding is a keystone in the application of machine learn- ing on several Software Engineering (SE) tasks. To effectively support a plethora of SE tasks, the embedding needs to capture program ... [more ▼]

Code embedding is a keystone in the application of machine learn- ing on several Software Engineering (SE) tasks. To effectively support a plethora of SE tasks, the embedding needs to capture program syntax and semantics in a way that is generic. To this end, we propose the first self-supervised pre-training approach (called GraphCode2Vec) which produces task-agnostic embedding of lexical and program dependence features. GraphCode2Vec achieves this via a synergistic combination of code analysis and Graph Neural Networks. GraphCode2Vec is generic, it allows pre-training, and it is applicable to several SE downstream tasks. We evaluate the effectiveness of GraphCode2Vec on four (4) tasks (method name prediction, solution classification, mutation testing and overfitted patch classification), and compare it with four (4) similarly generic code embedding baselines (Code2Seq, Code2Vec, CodeBERT, Graph- CodeBERT) and seven (7) task-specific, learning-based methods. In particular, GraphCode2Vec is more effective than both generic and task-specific learning-based baselines. It is also complementary and comparable to GraphCodeBERT (a larger and more complex model). We also demonstrate through a probing and ablation study that GraphCode2Vec learns lexical and program dependence features and that self-supervised pre-training improves effectiveness. [less ▲]

Detailed reference viewed: 13 (2 UL)
Full Text
Peer Reviewed
See detailiBiR: Bug Report driven Fault Injection
Khanfir, Ahmed UL; Koyuncu, Anil; Papadakis, Mike UL et al

in ACM Transactions on Software Engineering and Methodology (2022)

Detailed reference viewed: 25 (1 UL)
Full Text
Peer Reviewed
See detailMutation Testing in Evolving Systems: Studying the relevance of mutants to code evolution
Ojdanic, Milos UL; Soremekun, Ezekiel UL; Degiovanni, Renzo Gaston UL et al

in ACM Transactions on Software Engineering and Methodology (2022)

Context: When software evolves, opportunities for introducing faults appear. Therefore, it is important to test the evolved program behaviors during each evolution cycle. However, while software evolves ... [more ▼]

Context: When software evolves, opportunities for introducing faults appear. Therefore, it is important to test the evolved program behaviors during each evolution cycle. However, while software evolves, its complexity is also evolving, introducing challenges to the testing process. To deal with this issue, testing techniques should be adapted to target the effect of the program changes instead of the entire program functionality. To this end, commit-aware mutation testing, a powerful testing technique, has been proposed. Unfortunately, commit-aware mutation testing is challenging due to the complex program semantics involved. Hence, it is pertinent to understand the characteristics, predictability, and potential of the technique. Objective: We conduct an exploratory study to investigate the properties of commit-relevant mutants, i.e., the test elements of commit-aware mutation testing, by proposing a general definition and an experimental approach to identify them. We thus, aim at investigating the prevalence, location, and comparative advantages of commit-aware mutation testing over time (i.e., the program evolution). We also investigate the predictive power of several commit-related features in identifying and selecting commit-relevant mutants to understand the essential properties for its best-effort application case. Method: Our commit-relevant definition relies on the notion of observational slicing, approximated by higher-order mutation. Specifically, our approach utilizes the impact of mutants, effects of one mutant on another in capturing and analyzing the implicit interactions between the changed and unchanged code parts. The study analyses millions of mutants (over 10 million), 288 commits, five (5) different open-source software projects involving over 68,213 CPU days of computation and sets a ground truth where we perform our analysis. Results: Our analysis shows that commit-relevant mutants are located mainly outside of program commit change (81%), suggesting a limitation in previous work. We also note that effective selection of commit-relevant mutants has the potential of reducing the number of mutants by up to 93%. In addition, we demonstrate that commit relevant mutation testing is significantly more effective and efficient than state-of-the-art baselines, i.e., random mutant selection and analysis of only mutants within the program change. In our analysis of the predictive power of mutants and commit-related features (e.g., number of mutants within a change, mutant type, and commit size) in predicting commit-relevant mutants, we found that most proxy features do not reliably predict commit-relevant mutants. Conclusion: This empirical study highlights the properties of commit-relevant mutants and demonstrates the importance of identifying and selecting commit-relevant mutants when testing evolving software systems. [less ▲]

Detailed reference viewed: 34 (1 UL)
Full Text
See detailTowards Generalizable Machine Learning for Chest X-ray Diagnosis with Multi-task learning
Ghamizi, Salah UL; Garcia Santa Cruz, Beatriz UL; Temple, Paul et al

E-print/Working paper (2022)

Clinicians use chest radiography (CXR) to diagnose common pathologies. Automated classification of these diseases can expedite analysis workflow, scale to growing numbers of patients and reduce healthcare ... [more ▼]

Clinicians use chest radiography (CXR) to diagnose common pathologies. Automated classification of these diseases can expedite analysis workflow, scale to growing numbers of patients and reduce healthcare costs. While research has produced classification models that perform well on a given dataset, the same models lack generalization on different datasets. This reduces confidence that these models can be reliably deployed across various clinical settings. We propose an approach based on multitask learning to improve model generalization. We demonstrate that learning a (main) pathology together with an auxiliary pathology can significantly impact generalization performance (between -10% and +15% AUC-ROC). A careful choice of auxiliary pathology even yields competitive performance with state-of-the-art models that rely on fine-tuning or ensemble learning, using between 6% and 34% of the training data that these models required. We, further, provide a method to determine what is the best auxiliary task to choose without access to the target dataset. Ultimately, our work makes a big step towards the creation of CXR diagnosis models applicable in the real world, through the evidence that multitask learning can drastically improve generalization. [less ▲]

Detailed reference viewed: 122 (15 UL)
Full Text
Peer Reviewed
See detailCerebro: Static Subsuming Mutant Selection
Garg, Aayush UL; Ojdanic, Milos UL; Degiovanni, Renzo Gaston UL et al

in IEEE Transactions on Software Engineering (2022)

Detailed reference viewed: 129 (36 UL)
Full Text
Peer Reviewed
See detailEfficient and Transferable Adversarial Examples from Bayesian Neural Networks
Gubri, Martin UL; Cordy, Maxime UL; Papadakis, Mike UL et al

in The 38th Conference on Uncertainty in Artificial Intelligence (2022)

An established way to improve the transferability of black-box evasion attacks is to craft the adversarial examples on an ensemble-based surrogate to increase diversity. We argue that transferability is ... [more ▼]

An established way to improve the transferability of black-box evasion attacks is to craft the adversarial examples on an ensemble-based surrogate to increase diversity. We argue that transferability is fundamentally related to uncertainty. Based on a state-of-the-art Bayesian Deep Learning technique, we propose a new method to efficiently build a surrogate by sampling approximately from the posterior distribution of neural network weights, which represents the belief about the value of each parameter. Our extensive experiments on ImageNet, CIFAR-10 and MNIST show that our approach improves the success rates of four state-of-the-art attacks significantly (up to 83.2 percentage points), in both intra-architecture and inter-architecture transferability. On ImageNet, our approach can reach 94% of success rate while reducing training computations from 11.6 to 2.4 exaflops, compared to an ensemble of independently trained DNNs. Our vanilla surrogate achieves 87.5% of the time higher transferability than three test-time techniques designed for this purpose. Our work demonstrates that the way to train a surrogate has been overlooked, although it is an important element of transfer-based attacks. We are, therefore, the first to review the effectiveness of several training methods in increasing transferability. We provide new directions to better understand the transferability phenomenon and offer a simple but strong baseline for future work. [less ▲]

Detailed reference viewed: 62 (8 UL)
Full Text
Peer Reviewed
See detailLGV: Boosting Adversarial Example Transferability from Large Geometric Vicinity
Gubri, Martin UL; Cordy, Maxime UL; Papadakis, Mike UL et al

in Computer Vision -- ECCV 2022 (2022)

We propose transferability from Large Geometric Vicinity (LGV), a new technique to increase the transferability of black-box adversarial attacks. LGV starts from a pretrained surrogate model and collects ... [more ▼]

We propose transferability from Large Geometric Vicinity (LGV), a new technique to increase the transferability of black-box adversarial attacks. LGV starts from a pretrained surrogate model and collects multiple weight sets from a few additional training epochs with a constant and high learning rate. LGV exploits two geometric properties that we relate to transferability. First, models that belong to a wider weight optimum are better surrogates. Second, we identify a subspace able to generate an effective surrogate ensemble among this wider optimum. Through extensive experiments, we show that LGV alone outperforms all (combinations of) four established test-time transformations by 1.8 to 59.9\% points. Our findings shed new light on the importance of the geometry of the weight space to explain the transferability of adversarial examples. [less ▲]

Detailed reference viewed: 30 (0 UL)
Full Text
Peer Reviewed
See detailOn Evaluating Adversarial Robustness of Chest X-ray Classification: Pitfalls and Best Practices
Ghamizi, Salah UL; Cordy, Maxime UL; Papadakis, Mike UL et al

in The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI- 23) - SafeAI Workshop, Washington, D.C., Feb 13-14, 2023 (2022)

Detailed reference viewed: 51 (0 UL)
Full Text
Peer Reviewed
See detailµBert: Mutation Testing using Pre-Trained Language Models
Degiovanni, Renzo Gaston UL; Papadakis, Mike UL

in Degiovanni, Renzo Gaston; Papadakis, Mike (Eds.) µBert: Mutation Testing using Pre-Trained Language Models (2022)

We introduce µBert, a mutation testing tool that uses a pre-trained language model (CodeBERT) to generate mutants. This is done by masking a token from the expression given as input and using CodeBERT to ... [more ▼]

We introduce µBert, a mutation testing tool that uses a pre-trained language model (CodeBERT) to generate mutants. This is done by masking a token from the expression given as input and using CodeBERT to predict it. Thus, the mutants are generated by replacing the masked tokens with the predicted ones. We evaluate µBert on 40 real faults from Defects4J and show that it can detect 27 out of the 40 faults, while the baseline (PiTest) detects 26 of them. We also show that µBert can be 2 times more cost-effective than PiTest, when the same number of mutants are analysed. Additionally, we evaluate the impact of µBert's mutants when used by program assertion inference techniques, and show that they can help in producing better specifications. Finally, we discuss about the quality and naturalness of some interesting mutants produced by µBert during our experimental evaluation. [less ▲]

Detailed reference viewed: 96 (3 UL)
Full Text
Peer Reviewed
See detailAn Empirical Study on Data Distribution-Aware Test Selection for Deep Learning Enhancement
Hu, Qiang UL; Guo, Yuejun UL; Cordy, Maxime UL et al

in ACM Transactions on Software Engineering and Methodology (2022)

Similar to traditional software that is constantly under evolution, deep neural networks (DNNs) need to evolve upon the rapid growth of test data for continuous enhancement, e.g., adapting to distribution ... [more ▼]

Similar to traditional software that is constantly under evolution, deep neural networks (DNNs) need to evolve upon the rapid growth of test data for continuous enhancement, e.g., adapting to distribution shift in a new environment for deployment. However, it is labor-intensive to manually label all the collected test data. Test selection solves this problem by strategically choosing a small set to label. Via retraining with the selected set, DNNs will achieve competitive accuracy. Unfortunately, existing selection metrics involve three main limitations: 1) using different retraining processes; 2) ignoring data distribution shifts; 3) being insufficiently evaluated. To fill this gap, we first conduct a systemically empirical study to reveal the impact of the retraining process and data distribution on model enhancement. Then based on our findings, we propose a novel distribution-aware test (DAT) selection metric. Experimental results reveal that retraining using both the training and selected data outperforms using only the selected data. None of the selection metrics perform the best under various data distributions. By contrast, DAT effectively alleviates the impact of distribution shifts and outperforms the compared metrics by up to 5 times and 30.09% accuracy improvement for model enhancement on simulated and in-the-wild distribution shift scenarios, respectively. [less ▲]

Detailed reference viewed: 262 (62 UL)
Full Text
Peer Reviewed
See detailAdversarial Robustness in Multi-Task Learning: Promises and Illusions
Ghamizi, Salah UL; Cordy, Maxime UL; Papadakis, Mike UL et al

in Proceedings of the thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22) (2022)

Vulnerability to adversarial attacks is a well-known weakness of Deep Neural networks. While most of the studies focus on single-task neural networks with computer vision datasets, very little research ... [more ▼]

Vulnerability to adversarial attacks is a well-known weakness of Deep Neural networks. While most of the studies focus on single-task neural networks with computer vision datasets, very little research has considered complex multi-task models that are common in real applications. In this paper, we evaluate the design choices that impact the robustness of multi-task deep learning networks. We provide evidence that blindly adding auxiliary tasks, or weighing the tasks provides a false sense of robustness. Thereby, we tone down the claim made by previous research and study the different factors which may affect robustness. In particular, we show that the choice of the task to incorporate in the loss function are important factors that can be leveraged to yield more robust models. [less ▲]

Detailed reference viewed: 160 (10 UL)
Full Text
Peer Reviewed
See detailEvasion Attack STeganography: Turning Vulnerability Of Machine Learning ToAdversarial Attacks Into A Real-world Application
Ghamizi, Salah UL; Cordy, Maxime UL; Papadakis, Mike UL et al

in Proceedings of International Conference on Computer Vision 2021 (2021)

Evasion Attacks have been commonly seen as a weakness of Deep Neural Networks. In this paper, we flip the paradigm and envision this vulnerability as a useful application. We propose EAST, a new ... [more ▼]

Evasion Attacks have been commonly seen as a weakness of Deep Neural Networks. In this paper, we flip the paradigm and envision this vulnerability as a useful application. We propose EAST, a new steganography and watermarking technique based on multi-label targeted evasion attacks. Our results confirm that our embedding is elusive; it not only passes unnoticed by humans, steganalysis methods, and machine-learning detectors. In addition, our embedding is resilient to soft and aggressive image tampering (87% recovery rate under jpeg compression). EAST outperforms existing deep-learning-based steganography approaches with images that are 70% denser and 73% more robust and supports multiple datasets and architectures. [less ▲]

Detailed reference viewed: 178 (25 UL)
Full Text
Peer Reviewed
See detailRequirements And Threat Models of Adversarial Attacks and Robustness of Chest X-ray classification
Ghamizi, Salah UL; Cordy, Maxime UL; Papadakis, Mike UL et al

E-print/Working paper (2021)

Vulnerability to adversarial attacks is a well-known weakness of Deep Neural Networks. While most of the studies focus on natural images with standardized benchmarks like ImageNet and CIFAR, little ... [more ▼]

Vulnerability to adversarial attacks is a well-known weakness of Deep Neural Networks. While most of the studies focus on natural images with standardized benchmarks like ImageNet and CIFAR, little research has considered real world applications, in particular in the medical domain. Our research shows that, contrary to previous claims, robustness of chest x-ray classification is much harder to evaluate and leads to very different assessments based on the dataset, the architecture and robustness metric. We argue that previous studies did not take into account the peculiarity of medical diagnosis, like the co-occurrence of diseases, the disagreement of labellers (domain experts), the threat model of the attacks and the risk implications for each successful attack. In this paper, we discuss the methodological foundations, review the pitfalls and best practices, and suggest new methodological considerations for evaluating the robustness of chest xray classification models. Our evaluation on 3 datasets, 7 models, and 18 diseases is the largest evaluation of robustness of chest x-ray classification models. We believe our findings will provide reliable guidelines for realistic evaluation and improvement of the robustness of machine learning models for medical diagnosis. [less ▲]

Detailed reference viewed: 154 (19 UL)
Full Text
Peer Reviewed
See detailA Replication Study on the Usability of Code Vocabulary in Predicting Flaky Tests
Haben, Guillaume UL; Habchi, Sarra UL; Papadakis, Mike UL et al

in 18th International Conference on Mining Software Repositories (2021, May)

Abstract—Industrial reports indicate that flaky tests are one of the primary concerns of software testing mainly due to the false signals they provide. To deal with this issue, researchers have developed ... [more ▼]

Abstract—Industrial reports indicate that flaky tests are one of the primary concerns of software testing mainly due to the false signals they provide. To deal with this issue, researchers have developed tools and techniques aiming at (automatically) identifying flaky tests with encouraging results. However, to reach industrial adoption and practice, these techniques need to be replicated and evaluated extensively on multiple datasets, occasions and settings. In view of this, we perform a replication study of a recently proposed method that predicts flaky tests based on their vocabulary. We thus replicate the original study on three different dimensions. First we replicate the approach on the same subjects as in the original study but using a different evaluation methodology, i.e., we adopt a time-sensitive selection of training and test sets to better reflect the envisioned use case. Second, we consolidate the findings of the initial study by building a new dataset of 837 flaky tests from 9 projects in a different programming language, i.e., Python while the original study was in Java, which comforts the generalisability of the results. Third, we propose an extension to the original approach by experimenting with different features extracted from the Code Under Test. Our results demonstrate that a more robust validation has a consistent negative impact on the reported results of the original study, but, fortunately, these do not invalidate the key conclusions of the study. We also find re-assuring results that the vocabulary-based models can also be used to predict test flakiness in Python and that the information lying in the Code Under Test has a limited impact in the performance of the vocabulary-based models [less ▲]

Detailed reference viewed: 255 (24 UL)
Full Text
Peer Reviewed
See detailCONFUZZION: A Java Virtual Machine Fuzzer for Type Confusion Vulnerabilities
Bonnaventure, William; Khanfir, Ahmed UL; Bartel, Alexandre et al

in IEEE International Conference on Software Quality, Reliability, and Security (QRS), 2021 (2021)

Detailed reference viewed: 98 (13 UL)
Full Text
Peer Reviewed
See detailStatistical model checking for variability-intensive systems: applications to bug detection and minimization
Cordy, Maxime UL; Lazreg, Sami UL; Papadakis, Mike UL et al

in Formal Aspects of Computing (2021), 33(6), 1147--1172

Detailed reference viewed: 68 (8 UL)