[en] Smart contract comment generation has gained traction as a means to improve
code comprehension and maintainability in blockchain systems. However,
evaluating the quality of generated comments remains a challenge. Traditional
metrics such as BLEU and ROUGE fail to capture domain-specific nuances, while
human evaluation is costly and unscalable. In this paper, we present
\texttt{evalSmarT}, a modular and extensible framework that leverages large
language models (LLMs) as evaluators. The system supports over 400 evaluator
configurations by combining approximately 40 LLMs with 10 prompting strategies.
We demonstrate its application in benchmarking comment generation tools and
selecting the most informative outputs. Our results show that prompt design
significantly impacts alignment with human judgment, and that LLM-based
evaluation offers a scalable and semantically rich alternative to existing
methods.
Disciplines :
Computer science
Author, co-author :
MBODJI, Fatou Ndiaye ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX