On the impact of tokenizer and parameters on N-gram based Code Analysis

JIMENEZ, Matthieu; CORDY, Maxime; LE TRAON, Yves; PAPADAKIS, Mike

Download

Unpublished conference/Abstract (Scientific congresses, symposiums and conference proceedings)

On the impact of tokenizer and parameters on N-gram based Code Analysis

JIMENEZ, Matthieu; CORDY, Maxime; LE TRAON, Yves et al.

2018 • 34th IEEE International Conference on Software Maintenance and Evolution (ICSME'18)

Permalink
https://hdl.handle.net/10993/36135

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

icsme3.pdf

Author preprint (544.23 kB)

Download

All documents in ORBilu are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

N-Grams Model; Source Code Analysis; Naturalness

Abstract :

[en] Recent research shows that language models, such as n-gram models, are useful at a wide variety of software engineering tasks, e.g., code completion, bug identification, code summarisation, etc. However, such models require the appropriate set of numerous parameters. Moreover, the different ways one can read code essentially yield different models (based on the different sequences of tokens). In this paper, we focus on n- gram models and evaluate how the use of tokenizers, smoothing, unknown threshold and n values impact the predicting ability of these models. Thus, we compare the use of multiple tokenizers and sets of different parameters (smoothing, unknown threshold and n values) with the aim of identifying the most appropriate combinations. Our results show that the Modified Kneser-Ney smoothing technique performs best, while n values are depended on the choice of the tokenizer, with values 4 or 5 offering a good trade-off between entropy and computation time. Interestingly, we find that tokenizers treating the code as simple text are the most robust ones. Finally, we demonstrate that the differences between the tokenizers are of practical importance and have the potential of changing the conclusions of a given experiment.

Research center :

Interdisciplinary Centre for Security, Reliability and Trust (SnT) > Security Design and Validation Research Group (SerVal)

Disciplines :

Computer science

Author, co-author :

JIMENEZ, Matthieu ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT)

CORDY, Maxime ; University of Namur

LE TRAON, Yves ; University of Luxembourg > Faculty of Science, Technology and Communication (FSTC) > Computer Science and Communications Research Unit (CSC)

PAPADAKIS, Mike ; University of Luxembourg > Faculty of Science, Technology and Communication (FSTC) > Computer Science and Communications Research Unit (CSC)

External co-authors :

yes

Language :

English

Title :

On the impact of tokenizer and parameters on N-gram based Code Analysis

Publication date :

September 2018

Number of pages :

Event name :

34th IEEE International Conference on Software Maintenance and Evolution (ICSME'18)

Event place :

Madrid, Spain

Event date :

from 26-09-2018 to 28-09-2018

Audience :

International

Focus Area :

Computational Sciences

Available on ORBilu :

since 12 July 2018

Statistics

Number of views

195 (17 by Unilu)

Number of downloads

639 (22 by Unilu)

More statistics