Poster (Scientific congresses, symposiums and conference proceedings)
Cross-lingual Code Clone Detection: When LLMs Fail Short Against Embedding-based Classifier
MOUMOULA, Micheline Benedicte; KABORE, Abdoul Kader; KLEIN, Jacques et al.
2024Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering
Peer reviewed
 

Files


Full Text
clccd.pdf
Author postprint (427.08 kB)
Download

All documents in ORBilu are protected by a user license.

Send to



Details



Keywords :
code clone detection; cross-language pairs; embedding model; large language model; prompt engineering; Code clone detection; Cross languages; Cross-language pair; Cross-lingual; Embedding model; Embeddings; Language model; Language pairs; Large language model; Prompt engineering; Artificial Intelligence; Software; Safety, Risk, Reliability and Quality
Abstract :
[en] Cross-lingual code clone detection has gained attention in software development due to the use of multiple programming languages. Recent advances in machine learning, particularly Large Language Models (LLMs), have motivated a reexamination of this problem.This paper evaluates the performance of four LLMs and eight prompts for detecting cross-lingual code clones, as well as a pretrained embedding model for classifying clone pairs. Both approaches are tested on the XLCoST and CodeNet datasets.Our findings show that while LLMs achieve high F1 scores (up to 0.98) on straightforward programming examples, they struggle with complex cases and cross-lingual understanding. In contrast, embedding models, which map code fragments from different languages into a common representation space, allow for the training of a basic classifier that outperforms LLMs by approximately 2 and 24 percentage points on the XLCoST and CodeNet datasets, respectively. This suggests that embedding models provide more robust representations, enabling state-of-the-art performance in cross-lingual code clone detection.
Research center :
Interdisciplinary Centre for Security, Reliability and Trust (SnT) > TruX - Trustworthy Software Engineering
Disciplines :
Computer science
Author, co-author :
MOUMOULA, Micheline Benedicte  ;  University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX
KABORE, Abdoul Kader  ;  University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SNT Office > Project Coordination
KLEIN, Jacques  ;  University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX
Bissyande, Tegawende F. ;  University of Luxembourg, Luxembourg
External co-authors :
no
Language :
English
Title :
Cross-lingual Code Clone Detection: When LLMs Fail Short Against Embedding-based Classifier
Publication date :
27 October 2024
Event name :
Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering
Event place :
Sacramento, Usa
Event date :
28-10-2024 => 01-11-2024
By request :
Yes
Peer reviewed :
Peer reviewed
European Projects :
H2020 - 949014 - NATURAL - Natural Program Repair
Name of the research project :
R-AGR-3790 - LuxWays - part UL - BISSYANDE Tegawendé
Funders :
Ministry of Foreign Affairs, European Union and Cooperation
European Union
Funding number :
R-AGR-3790; H2020 - 949014
Available on ORBilu :
since 21 January 2026

Statistics


Number of views
10 (1 by Unilu)
Number of downloads
2 (0 by Unilu)

Scopus citations®
 
3
Scopus citations®
without self-citations
3
OpenCitations
 
0
OpenAlex citations
 
0

Bibliography


Similar publications



Contact ORBilu