Assessing the Generalizability of code2vec Token Embeddings

Kang, Hong Jin; BISSYANDE, Tegawendé François D Assise; David, Lo

doi:10.1109/ASE.2019.00011

Download

Paper published in a book (Scientific congresses, symposiums and conference proceedings)

Assessing the Generalizability of code2vec Token Embeddings

Kang, Hong Jin; BISSYANDE, Tegawendé François D Assise; David, Lo

2019 • In Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering

Peer reviewed

Permalink
https://hdl.handle.net/10993/41962

DOI
10.1109/ASE.2019.00011

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

ase19-code2vec.pdf

Author preprint (250.45 kB)

Download

All documents in ORBilu are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

Code Embeddings; Distributed Representations; Big Code

Abstract :

[en] Many Natural Language Processing (NLP) tasks, such as sentiment analysis or syntactic parsing, have benefited from the development of word embedding models. In particular, regardless of the training algorithms, the learned embeddings have often been shown to be generalizable to different NLP tasks. In contrast, despite recent momentum on word embeddings for source code, the literature lacks evidence of their generalizability beyond the example task they have been trained for. In this experience paper, we identify 3 potential downstream tasks, namely code comments generation, code authorship identification, and code clones detection, that source code token embedding models can be applied to. We empirically assess a recently proposed code token embedding model, namely code2vec’s token embeddings. Code2vec was trained on the task of predicting method names, and while there is potential for using the vectors it learns on other tasks, it has not been explored in literature. Therefore, we fill this gap by focusing on its generalizability for the tasks we have identified. Eventually, we show that source code token embeddings cannot be readily leveraged for the downstream tasks. Our experiments even show that our attempts to use them do not result in any improvements over less sophisticated methods. We call for more research into effective and general use of code embeddings.

Disciplines :

Computer science

Author, co-author :

Kang, Hong Jin; Singapore Management University > SIS

BISSYANDE, Tegawendé François D Assise ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT)

David, Lo

External co-authors :

yes

Language :

English

Title :

Assessing the Generalizability of code2vec Token Embeddings

Publication date :

November 2019

Event name :

34th IEEE/ACM International Conference on Automated Software Engineering

Event place :

San Diego, California, United States

Event date :

from 10/11/2019 to 15/11/2019

Audience :

International

Main work title :

Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering

Pages :

1-12

Peer reviewed :

Peer reviewed

Focus Area :

Security, Reliability and Trust

Available on ORBilu :

since 23 January 2020

Statistics

Number of views

243 (8 by Unilu)

Number of downloads

197 (6 by Unilu)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenAlex citations