Reference : On the impact of tokenizer and parameters on N-gram based Code Analysis
Scientific congresses, symposiums and conference proceedings : Unpublished conference
Engineering, computing & technology : Computer science
Computational Sciences
http://hdl.handle.net/10993/36135
On the impact of tokenizer and parameters on N-gram based Code Analysis
English
Jimenez, Matthieu mailto [University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > >]
Cordy, Maxime mailto [University of Namur]
Le Traon, Yves mailto [University of Luxembourg > Faculty of Science, Technology and Communication (FSTC) > Computer Science and Communications Research Unit (CSC) >]
Papadakis, Mike mailto [University of Luxembourg > Faculty of Science, Technology and Communication (FSTC) > Computer Science and Communications Research Unit (CSC) >]
Sep-2018
10
Yes
No
International
34th IEEE International Conference on Software Maintenance and Evolution (ICSME'18)
from 26-09-2018 to 28-09-2018
Madrid
Spain
[en] N-Grams Model ; Source Code Analysis ; Naturalness
[en] Recent research shows that language models, such as n-gram models, are useful at a wide variety of software engineering tasks, e.g., code completion, bug identification, code summarisation, etc. However, such models require the appropriate set of numerous parameters. Moreover, the different ways one can read code essentially yield different models (based on the different sequences of tokens). In this paper, we focus on n- gram models and evaluate how the use of tokenizers, smoothing, unknown threshold and n values impact the predicting ability of these models. Thus, we compare the use of multiple tokenizers and sets of different parameters (smoothing, unknown threshold and n values) with the aim of identifying the most appropriate combinations. Our results show that the Modified Kneser-Ney smoothing technique performs best, while n values are depended on the choice of the tokenizer, with values 4 or 5 offering a good trade-off between entropy and computation time. Interestingly, we find that tokenizers treating the code as simple text are the most robust ones. Finally, we demonstrate that the differences between the tokenizers are of practical importance and have the potential of changing the conclusions of a given experiment.
Interdisciplinary Centre for Security, Reliability and Trust (SnT) > Security Design and Validation Research Group (SerVal)
http://hdl.handle.net/10993/36135

File(s) associated to this reference

Fulltext file(s):

FileCommentaryVersionSizeAccess
Open access
icsme3.pdfAuthor preprint531.47 kBView/Open

Bookmark and Share SFX Query

All documents in ORBilu are protected by a user license.