Abstract :
[en] Software vulnerabilities pose significant risks to the security and reliability of computer systems, potentially leading to data breaches, service disruptions, and financial losses. As a result, the timely and accurate prediction and repair of vulnerabilities is paramount. Over the last few years, automatic program repair (APR) has made major progress in patching general bugs as opposed to security-related bugs. Mainly, the reason for this is that most program repair techniques rely on test cases as a key ingredient for driving patch generation and validation, whereas in the specific case of vulnerabilities, there is a lack of test cases that are known as exploits in practice. Also, the scarcity of software vulnerability fixes forms a “sound” barrier to vulnerability repair. In this dissertation, we examine the prediction of vulnerable code and the automatic repair of vulnerable code using artificial neural networks.
Firstly, we investigated source code representation learning, aiming to improve performances in vulnerable code prediction. As a result, we identified that a number of code signals remained unexploited. We have therefore proposed WYSiWiM and CodeGrid, which leverage computer vision techniques to perform several software engineering tasks (including code clone detection and vulnerability prediction). While WYSiWiM is a purely visual representation of source code, CodeGrid represents code as grids.
Secondly, we targeted the automated repair of vulnerable code. Given the scarcity of test cases, we design NERVE, a novel deep learning-based approach for automating vulnerable software repair. Instead of tests, NERVE leverages the signal in vulnerability prediction and fixes suggestions of static analysis security testing (SAST) to learn to repair vulnerable code. Nerve’s learning architecture relies on CodeT5 pre-trained model for source code representation, augmented with a mixed learning objective. This involves, first, the use of triplet loss to build an embedding space that brings each vulnerable code closer to good fixes while keeping it away from incorrect fixes. The second learning objective incorporates cosine similarity into its loss function to align its repair candidates with SAST fix suggestions. We assess NERVE on over 3,000 real-world vulnerable code samples in C/C++, C#, Python, and Java programs. We show that NERVE advances the SOTA in neural vulnerability repair, outperforming SOTA approaches.
Finally, we propose in this work to break the limitations associated with vulnerability fixes data by proposing VulFix, an approach for automatically mining open-source project repositories with the aim of building the biggest vulnerability dataset composed of pairs of vulnerable code and its fix. We applied VulFix to 373 open source projects to build and release VulFixDepot, a dataset for automatic vulnerability repair with more than 36k pairs in several programming languages.