[en] With the involvement of multiple programming languages in modern software
development, cross-lingual code clone detection has gained traction within the
software engineering community. Numerous studies have explored this topic,
proposing various promising approaches. Inspired by the significant advances in
machine learning in recent years, particularly Large Language Models (LLMs),
which have demonstrated their ability to tackle various tasks, this paper
revisits cross-lingual code clone detection. We evaluate the performance of
five (05) LLMs and eight prompts (08) for the identification of cross-lingual
code clones. Additionally, we compare these results against two baseline
methods. Finally, we evaluate a pre-trained embedding model to assess the
effectiveness of the generated representations for classifying clone and
non-clone pairs. The studies involving LLMs and Embedding models are evaluated
using two widely used cross-lingual datasets, XLCoST and CodeNet. Our results
show that LLMs can achieve high F1 scores, up to 0.99, for straightforward
programming examples. However, they not only perform less well on programs
associated with complex programming challenges but also do not necessarily
understand the meaning of "code clones" in a cross-lingual setting. We show
that embedding models used to represent code fragments from different
programming languages in the same representation space enable the training of a
basic classifier that outperforms all LLMs by ~1 and ~20 percentage points on
the XLCoST and CodeNet datasets, respectively. This finding suggests that,
despite the apparent capabilities of LLMs, embeddings provided by embedding
models offer suitable representations to achieve state-of-the-art performance
in cross-lingual code clone detection.
Research center :
Interdisciplinary Centre for Security, Reliability and Trust (SnT) > Other
Disciplines :
Computer science
Author, co-author :
MOUMOULA, Micheline Benedicte ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX
KABORE, Abdoul Kader ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SNT Office > Project Coordination