Unpublished conference/Abstract (Scientific congresses, symposiums and conference proceedings)
On the impact of tokenizer and parameters on N-gram based Code Analysis
Jimenez, Matthieu; Cordy, Maxime; Le Traon, Yves et al.
201834th IEEE International Conference on Software Maintenance and Evolution (ICSME'18)
 

Files


Full Text
icsme3.pdf
Author preprint (544.23 kB)
Download

All documents in ORBilu are protected by a user license.

Send to



Details



Keywords :
N-Grams Model; Source Code Analysis; Naturalness
Abstract :
[en] Recent research shows that language models, such as n-gram models, are useful at a wide variety of software engineering tasks, e.g., code completion, bug identification, code summarisation, etc. However, such models require the appropriate set of numerous parameters. Moreover, the different ways one can read code essentially yield different models (based on the different sequences of tokens). In this paper, we focus on n- gram models and evaluate how the use of tokenizers, smoothing, unknown threshold and n values impact the predicting ability of these models. Thus, we compare the use of multiple tokenizers and sets of different parameters (smoothing, unknown threshold and n values) with the aim of identifying the most appropriate combinations. Our results show that the Modified Kneser-Ney smoothing technique performs best, while n values are depended on the choice of the tokenizer, with values 4 or 5 offering a good trade-off between entropy and computation time. Interestingly, we find that tokenizers treating the code as simple text are the most robust ones. Finally, we demonstrate that the differences between the tokenizers are of practical importance and have the potential of changing the conclusions of a given experiment.
Research center :
Interdisciplinary Centre for Security, Reliability and Trust (SnT) > Security Design and Validation Research Group (SerVal)
Disciplines :
Computer science
Author, co-author :
Jimenez, Matthieu  ;  University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT)
Cordy, Maxime  ;  University of Namur
Le Traon, Yves ;  University of Luxembourg > Faculty of Science, Technology and Communication (FSTC) > Computer Science and Communications Research Unit (CSC)
Papadakis, Mike ;  University of Luxembourg > Faculty of Science, Technology and Communication (FSTC) > Computer Science and Communications Research Unit (CSC)
External co-authors :
yes
Language :
English
Title :
On the impact of tokenizer and parameters on N-gram based Code Analysis
Publication date :
September 2018
Number of pages :
10
Event name :
34th IEEE International Conference on Software Maintenance and Evolution (ICSME'18)
Event place :
Madrid, Spain
Event date :
from 26-09-2018 to 28-09-2018
Audience :
International
Focus Area :
Computational Sciences
Available on ORBilu :
since 12 July 2018

Statistics


Number of views
186 (17 by Unilu)
Number of downloads
623 (22 by Unilu)

Bibliography


Similar publications



Contact ORBilu