References of "Jimenez, Matthieu 50002043"
     in
Bookmark and Share    
Full Text
Peer Reviewed
See detailCODEBERT-NT: code naturalness via CodeBERT
Khanfir, Ahmed UL; Jimenez, Matthieu UL; Papadakis, Mike UL et al

in 22nd IEEE International Conference on Software Quality, Reliability and Security (QRS'22) (2022, December 05)

Much of recent software-engineering research has investigated the naturalness of code, the fact that code, in small code snippets, is repetitive and can be predicted using statistical language models like ... [more ▼]

Much of recent software-engineering research has investigated the naturalness of code, the fact that code, in small code snippets, is repetitive and can be predicted using statistical language models like n-gram. Although powerful, training such models on large code corpus can be tedious, time consuming and sensitive to code patterns (and practices) encountered during training. Consequently, these models are often trained on a small corpus and thus only estimate the language naturalness relative to a specific style of programming or type of project. To overcome these issues, we investigate the use of pre-trained generative language models to infer code naturalness. Pre-trained models are often built on big data, are easy to use in an out-of-the-box way and include powerful learning associations mechanisms. Our key idea is to quantify code naturalness through its predictability, by using state-of-the-art generative pre-trained language models. Thus, we suggest to infer naturalness by masking (omitting) code tokens, one at a time, of code-sequences, and checking the models’ability to predict them. We explore three different predictability metrics; a) measuring the number of exact matches of the predictions, b) computing the embedding similarity between the original and predicted code, i.e., similarity at the vector space, and c) computing the confidence of the model when doing the token completion task regardless of the outcome. We implement this workflow, named CODEBERT-NT, and evaluate its capability to prioritize buggy lines over non-buggy ones when ranking code based on its naturalness. Our results, on 2,510 buggy versions of 40 projects from the SmartShark dataset, show that CODEBERT-NT outperforms both, random-uniform and complexity-based ranking techniques, and yields comparable results to the n-gram models. [less ▲]

Detailed reference viewed: 76 (15 UL)
Full Text
Peer Reviewed
See detailFaster and Cheaper Energy Demand Forecasting at Scale
Bernier, Fabien UL; Jimenez, Matthieu UL; Cordy, Maxime UL et al

in Has it Trained Yet? Workshop at the Conference on Neural Information Processing Systems (2022, December 02)

Energy demand forecasting is one of the most challenging tasks for grids operators. Many approaches have been suggested over the years to tackle it. Yet, those still remain too expensive to train in terms ... [more ▼]

Energy demand forecasting is one of the most challenging tasks for grids operators. Many approaches have been suggested over the years to tackle it. Yet, those still remain too expensive to train in terms of both time and computational resources, hindering their adoption as customers behaviors are continuously evolving. We introduce Transplit, a new lightweight transformer-based model, which significantly decreases this cost by exploiting the seasonality property and learning typical days of power demand. We show that Transplit can be run efficiently on CPU and is several hundred times faster than state-of-the-art predictive models, while performing as well. [less ▲]

Detailed reference viewed: 122 (10 UL)
Full Text
Peer Reviewed
See detailAre mutants really natural? A study on how “naturalness” helps mutant selection
Jimenez, Matthieu UL; Titcheu Chekam, Thierry UL; Cordy, Maxime UL et al

in Proceedings of 12th International Symposium on 
 Empirical Software Engineering and Measurement (ESEM'18) (2018, October 11)

Background: Code is repetitive and predictable in a way that is similar to the natural language. This means that code is ``natural'' and this ``naturalness'' can be captured by natural language modelling ... [more ▼]

Background: Code is repetitive and predictable in a way that is similar to the natural language. This means that code is ``natural'' and this ``naturalness'' can be captured by natural language modelling techniques. Such models promise to capture the program semantics and identify source code parts that `smell', i.e., they are strange, badly written and are generally error-prone (likely to be defective). Aims: We investigate the use of natural language modelling techniques in mutation testing (a testing technique that uses artificial faults). We thus, seek to identify how well artificial faults simulate real ones and ultimately understand how natural the artificial faults can be. %We investigate this question in a fault revelation perspective. Our intuition is that natural mutants, i.e., mutants that are predictable (follow the implicit coding norms of developers), are semantically useful and generally valuable (to testers). We also expect that mutants located on unnatural code locations (which are generally linked with error-proneness) to be of higher value than those located on natural code locations. Method: Based on this idea, we propose mutant selection strategies that rank mutants according to a) their naturalness (naturalness of the mutated code), b) the naturalness of their locations (naturalness of the original program statements) and c) their impact on the naturalness of the code that they apply to (naturalness differences between original and mutated statements). We empirically evaluate these issues on a benchmark set of 5 open-source projects, involving more than 100k mutants and 230 real faults. Based on the fault set we estimate the utility (i.e. capability to reveal faults) of mutants selected on the basis of their naturalness, and compare it against the utility of randomly selected mutants. Results: Our analysis shows that there is no link between naturalness and the fault revelation utility of mutants. We also demonstrate that the naturalness-based mutant selection performs similar (slightly worse) to the random mutant selection. Conclusions: Our findings are negative but we consider them interesting as they confute a strong intuition, i.e., fault revelation is independent of the mutants' naturalness. [less ▲]

Detailed reference viewed: 201 (23 UL)
Full Text
See detailEvaluating Vulnerability Prediction Models
Jimenez, Matthieu UL

Doctoral thesis (2018)

Today almost every device depends on a piece of software. As a result, our life increasingly depends on some software form such as smartphone apps, laundry machines, web applications, computers ... [more ▼]

Today almost every device depends on a piece of software. As a result, our life increasingly depends on some software form such as smartphone apps, laundry machines, web applications, computers, transportation and many others, all of which rely on software. Inevitably, this dependence raises the issue of software vulnerabilities and their possible impact on our lifestyle. Over the years, researchers and industrialists suggested several approaches to detect such issues and vulnerabilities. A particular popular branch of such approaches, usually called Vulnerability Prediction Modelling (VPM) techniques, leverage prediction modelling techniques that flag suspicious (likely vulnerable) code components. These techniques rely on source code features as indicators of vulnerabilities to build the prediction models. However, the emerging question is how effective such methods are and how they can be used in practice. The present dissertation studies vulnerability prediction models and evaluates them on real and reliable playground. To this end, it suggests a toolset that automatically collects real vulnerable code instances, from major open source systems, suitable for applying VPM. These code instances are then used to analyze, replicate, compare and develop new VPMs. Specifically, the dissertation has 3 main axes: The first regards the analysis of vulnerabilities. Indeed, to build VPMs accurately, numerous data are required. However, by their nature, vulnerabilities are scarce and the information about them is spread over different sources (NVD, Git, Bug Trackers). Thus, the suggested toolset (develops an automatic way to build a large dataset) enables the reliable and relevant analysis of VPMs. The second axis focuses on the empirical comparison and analysis of existing Vulnerability Prediction Models. It thus develops and replicates existing VPMs. To this end, the thesis introduces a framework that builds, analyse and compares existing prediction models (using the already proposed sets of features) using the dataset developed on the first axis. The third axis explores the use of cross-entropy (metric used by natural language processing) as a potential feature for developing new VPMs. Cross-entropy, usually referred to as the naturalness of code, is a recent approach that measures the repetitiveness of code (relying on statistical models). Using cross-entropy, the thesis investigates different ways of building and using VPMs. Overall, this thesis provides a fully-fledge study on Vulnerability Prediction Models aiming at assessing and improving their performance. [less ▲]

Detailed reference viewed: 560 (56 UL)
Full Text
Peer Reviewed
See detailTUNA: TUning Naturalness-based Analysis
Jimenez, Matthieu UL; Cordy, Maxime UL; Le Traon, Yves UL et al

in 34th IEEE International Conference on Software Maintenance and Evolution, Madrid, Spain, 26-28 September 2018 (2018, September 26)

Natural language processing techniques, in particular n-gram models, have been applied successfully to facilitate a number of software engineering tasks. However, in our related ICSME ’18 paper, we have ... [more ▼]

Natural language processing techniques, in particular n-gram models, have been applied successfully to facilitate a number of software engineering tasks. However, in our related ICSME ’18 paper, we have shown that the conclusions of a study can drastically change with respect to how the code is tokenized and how the used n-gram model is parameterized. These choices are thus of utmost importance, and one must carefully make them. To show this and allow the community to benefit from our work, we have developed TUNA (TUning Naturalness-based Analysis), a Java software artifact to perform naturalness-based analyses of source code. To the best of our knowledge, TUNA is the first open- source, end-to-end toolchain to carry out source code analyses based on naturalness. [less ▲]

Detailed reference viewed: 212 (12 UL)
Full Text
Peer Reviewed
See detailOn the impact of tokenizer and parameters on N-gram based Code Analysis
Jimenez, Matthieu UL; Cordy, Maxime UL; Le Traon, Yves UL et al

Scientific Conference (2018, September)

Recent research shows that language models, such as n-gram models, are useful at a wide variety of software engineering tasks, e.g., code completion, bug identification, code summarisation, etc. However ... [more ▼]

Recent research shows that language models, such as n-gram models, are useful at a wide variety of software engineering tasks, e.g., code completion, bug identification, code summarisation, etc. However, such models require the appropriate set of numerous parameters. Moreover, the different ways one can read code essentially yield different models (based on the different sequences of tokens). In this paper, we focus on n- gram models and evaluate how the use of tokenizers, smoothing, unknown threshold and n values impact the predicting ability of these models. Thus, we compare the use of multiple tokenizers and sets of different parameters (smoothing, unknown threshold and n values) with the aim of identifying the most appropriate combinations. Our results show that the Modified Kneser-Ney smoothing technique performs best, while n values are depended on the choice of the tokenizer, with values 4 or 5 offering a good trade-off between entropy and computation time. Interestingly, we find that tokenizers treating the code as simple text are the most robust ones. Finally, we demonstrate that the differences between the tokenizers are of practical importance and have the potential of changing the conclusions of a given experiment. [less ▲]

Detailed reference viewed: 254 (16 UL)
Full Text
Peer Reviewed
See detailEnabling the Continous Analysis of Security Vulnerabilities with VulData7
Jimenez, Matthieu UL; Le Traon, Yves UL; Papadakis, Mike UL

in IEEE International Working Conference on Source Code Analysis and Manipulation (2018)

Detailed reference viewed: 347 (37 UL)
Full Text
Peer Reviewed
See detailAnalyzing Complex Data in Motion at Scale with Temporal Graphs
Hartmann, Thomas UL; Fouquet, François UL; Jimenez, Matthieu UL et al

in Proceedings of the 29th International Conference on Software Engineering and Knowledge Engineering (2017, July)

Modern analytics solutions succeed to understand and predict phenomenons in a large diversity of software systems, from social networks to Internet-of-Things platforms. This success challenges analytics ... [more ▼]

Modern analytics solutions succeed to understand and predict phenomenons in a large diversity of software systems, from social networks to Internet-of-Things platforms. This success challenges analytics algorithms to deal with more and more complex data, which can be structured as graphs and evolve over time. However, the underlying data storage systems that support large-scale data analytics, such as time-series or graph databases, fail to accommodate both dimensions, which limits the integration of more advanced analysis taking into account the history of complex graphs, for example. This paper therefore introduces a formal and practical definition of temporal graphs. Temporal graphs pro- vide a compact representation of time-evolving graphs that can be used to analyze complex data in motion. In particular, we demonstrate with our open-source implementation, named GREYCAT, that the performance of temporal graphs allows analytics solutions to deal with rapidly evolving large-scale graphs. [less ▲]

Detailed reference viewed: 313 (19 UL)
Full Text
See detailOn the Naturalness of Mutants
Jimenez, Matthieu UL; Cordy, Maxime UL; Kintis, Marinos UL et al

E-print/Working paper (2017)

Detailed reference viewed: 197 (15 UL)
Full Text
Peer Reviewed
See detailAn Empirical Analysis of Vulnerabilities in OpenSSL and the Linux Kernel
Jimenez, Matthieu UL; Papadakis, Mike UL; Le Traon, Yves UL

in 2016 Asia-Pacific Software Engineering Conference (APSEC) (2016, December)

Vulnerabilities are one of the main concerns faced by practitioners when working with security critical applications. Unfortunately, developers and security teams, even experienced ones, fail to identify ... [more ▼]

Vulnerabilities are one of the main concerns faced by practitioners when working with security critical applications. Unfortunately, developers and security teams, even experienced ones, fail to identify many of them with severe consequences. Vulnerabilities are hard to discover since they appear in various forms, caused by many different issues and their identification requires an attacker’s mindset. In this paper, we aim at increasing the understanding of vulnerabilities by investigating their characteristics on two major open-source software systems, i.e., the Linux kernel and OpenSSL. In particular, we seek to analyse and build a profile for vulnerable code, which can ultimately help researchers in building automated approaches like vulnerability prediction models. Thus, we examine the location, criticality and category of vulnerable code along with its relation with software metrics. To do so, we collect more than 2,200 vulnerable files accounting for 863 vulnerabilities and compute more than 35 software metrics. Our results indicate that while 9 Common Weakness Enumeration (CWE) types of vulnerabilities are prevalent, only 3 of them are critical in OpenSSL and 2 of them in the Linux kernel. They also indicate that different types of vulnerabilities have different characteristics, i.e., metric profiles, and that vulnerabilities of the same type have different profiles in the two projects we examined. We also found that the file structure of the projects can provide useful information related to the vulnerabilities. Overall, our results demonstrate the need for making project specific approaches that focus on specific types of vulnerabilities. [less ▲]

Detailed reference viewed: 333 (17 UL)
Full Text
Peer Reviewed
See detailVulnerability Prediction Models: A case study on the Linux Kernel
Jimenez, Matthieu UL; Papadakis, Mike UL; Le Traon, Yves UL

in 16th IEEE International Working Conference on Source Code Analysis and Manipulation, SCAM 2016, Raleigh, US, October 2-3, 2016 (2016, October)

To assist the vulnerability identification process, researchers proposed prediction models that highlight (for inspection) the most likely to be vulnerable parts of a system. In this paper we aim at ... [more ▼]

To assist the vulnerability identification process, researchers proposed prediction models that highlight (for inspection) the most likely to be vulnerable parts of a system. In this paper we aim at making a reliable replication and comparison of the main vulnerability prediction models. Thus, we seek for determining their effectiveness, i.e., their ability to distinguish between vulnerable and non-vulnerable components, in the context of the Linux Kernel, under different scenarios. To achieve the above-mentioned aims, we mined vulnerabilities reported in the National Vulnerability Database and created a large dataset with all vulnerable components of Linux from 2005 to 2016. Based on this, we then built and evaluated the prediction models. We observe that an approach based on the header files included and on function calls performs best when aiming at future vulnerabilities, while text mining is the best technique when aiming at random instances. We also found that models based on code metrics perform poorly. We show that in the context of the Linux kernel, vulnerability prediction models can be superior to random selection and relatively precise. Thus, we conclude that practitioners have a valuable tool for prioritizing their security inspection efforts. [less ▲]

Detailed reference viewed: 473 (32 UL)
Full Text
Peer Reviewed
See detailProfiling Android Vulnerabilities
Jimenez, Matthieu UL; Papadakis, Mike UL; Bissyande, Tegawendé François D Assise UL et al

in 2016 IEEE International Conference on Software Quality, Reliability and Security (QRS 2016) (2016, August)

In widely used mobile operating systems a single vulnerability can threaten the security and privacy of billions of users. Therefore, identifying vulnerabilities and fortifying software systems requires ... [more ▼]

In widely used mobile operating systems a single vulnerability can threaten the security and privacy of billions of users. Therefore, identifying vulnerabilities and fortifying software systems requires constant attention and effort. However, this is costly and it is almost impossible to analyse an entire code base. Thus, it is necessary to prioritize efforts towards the most likely vulnerable areas. A first step in identifying these areas is to profile vulnerabilities based on previously reported ones. To investigate this, we performed a manual analysis of Android vulnerabilities, as reported in the National Vulnerability Database for the period 2008 to 2014. In our analysis, we identified a comprehensive list of issues leading to Android vulnerabilities. We also point out characteristics of the locations where vulnerabilities reside, the complexity of these locations and the complexity to fix the vulnerabilities. To enable future research, we make available all of our data. [less ▲]

Detailed reference viewed: 374 (31 UL)