What You See is What it Means! Semantic Representation Learning of Code based on Visualization

KELLER, Patrick; KABORE, Abdoul Kader; Plein, Laura; KLEIN, Jacques; LE TRAON, Yves; BISSYANDE, Tegawendé François D Assise

doi:10.1145/3485135

Download

Article (Scientific journals)

What You See is What it Means! Semantic Representation Learning of Code based on Visualization

KELLER, Patrick; KABORE, Abdoul Kader; Plein, Laura et al.

2021 • In ACM Transactions on Software Engineering and Methodology

Peer Reviewed verified by ORBi

Permalink
https://hdl.handle.net/10993/48899

DOI
10.1145/3485135

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

WYSiWiM_New_Version_TOSEM_Final.pdf

Author preprint (2.88 MB)

Download

All documents in ORBilu are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

Code Embedding; Visual Representation; Representation Learning; Vulnerability Detection; Code Clone; Code Classification

Abstract :

[en] Recent successes in training word embeddings for NLP tasks have encouraged a wave of research on representation learning for sourcecode, which builds on similar NLP methods. The overall objective is then to produce code embeddings that capture the maximumof program semantics. State-of-the-art approaches invariably rely on a syntactic representation (i.e., raw lexical tokens, abstractsyntax trees, or intermediate representation tokens) to generate embeddings, which are criticized in the literature as non-robustor non-generalizable. In this work, we investigate a novel embedding approach based on the intuition that source code has visualpatterns of semantics. We further use these patterns to address the outstanding challenge of identifying semantic code clones. Wepropose theWySiWiM(“What You See Is What It Means”) approach where visual representations of source code are fed into powerfulpre-trained image classification neural networks from the field of computer vision to benefit from the practical advantages of transferlearning. We evaluate the proposed embedding approach on the task of vulnerable code prediction in source code and on two variationsof the task of semantic code clone identification: code clone detection (a binary classification problem), and code classification (amulti-classification problem). We show with experiments on the BigCloneBench (Java), Open Judge (C) that although simple, ourWySiWiMapproach performs as effectively as state of the art approaches such as ASTNN or TBCNN. We also showed with datafrom NVD and SARD thatWySiWiMrepresentation can be used to learn a vulnerable code detector with reasonable performance(accuracy∼90%). We further explore the influence of different steps in our approach, such as the choice of visual representations or theclassification algorithm, to eventually discuss the promises and limitations of this research direction.

Research center :

Interdisciplinary Centre for Security, Reliability and Trust (SnT) > Trustworthy Software Engineering (TruX)

Disciplines :

Computer science

Author, co-author :

KELLER, Patrick ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)

KABORE, Abdoul Kader ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX

Plein, Laura; Saarland University

KLEIN, Jacques ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX

LE TRAON, Yves ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SerVal

BISSYANDE, Tegawendé François D Assise ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX

External co-authors :

yes

Language :

English

Title :

What You See is What it Means! Semantic Representation Learning of Code based on Visualization

Publication date :

2021

Journal title :

ACM Transactions on Software Engineering and Methodology

ISSN :

1049-331X

Publisher :

Association for Computing Machinery (ACM), United States

Special issue title :

Continuous Special Section: AI and SE

Peer reviewed :

Peer Reviewed verified by ORBi

Focus Area :

Security, Reliability and Trust

FnR Project :

FNR14591304 - Neural Vulnerable Program Repair, 2020 (01/10/2020-30/09/2024) - Abdoul Kader Kaboré

Funders :

FNR - Fonds National de la Recherche
Gouvernement du Luxembourg under the LuxWays Project
CER - Conseil Européen de la Recherche

Available on ORBilu :

since 09 December 2021

Statistics

Number of views

359 (22 by Unilu)

Number of downloads

173 (6 by Unilu)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenCitations

OpenAlex citations

WoS citations^™