Article (Scientific journals)
TaskEval: Assessing Difficulty of Code Generation Tasks for Large Language Models
TAMBON, Florian; Nikanjam, Amin; Zid, Cyrine et al.
2025In ACM Transactions on Software Engineering and Methodology
Peer Reviewed verified by ORBi
 

Files


Full Text
2407.21227v3.pdf
Author preprint (1.48 MB)
Download

All documents in ORBilu are protected by a user license.

Send to



Details



Keywords :
Computer Science - Software Engineering; Computer Science - Artificial Intelligence
Abstract :
[en] Large Language Models (LLMs) excel in code-related tasks like code generation, but benchmark evaluations often overlook task characteristics, such as difficulty. Moreover, benchmarks are usually built using tasks described with a single prompt, despite the formulation of prompts having a profound impact on the outcome. This paper introduces a generalist approach, TaskEval, a framework using diverse prompts and Item Response Theory (IRT) to efficiently assess LLMs' capabilities and benchmark task characteristics, improving the understanding of their performance. Using two code generation benchmarks, \textit{HumanEval}+ and \textit{ClassEval}, as well as 8 code generation LLMs, we show that \textit{TaskEval} is capable of characterising the properties of tasks. Using topic analysis, we identify and analyse the tasks of 17 and 21 topics within the benchmarks. We also cross-analyse tasks' characteristics with programming constructs (e.g., variable assignment, conditions, etc.) used by LLMs, emphasising some patterns with tasks' difficulty. Finally, we conduct a comparison between the difficulty assessment of tasks by human annotators and LLMs. Orthogonal to current benchmarking evaluation efforts, \textit{TaskEval} can assist researchers and practitioners in fostering better assessments of LLMs. The tasks' characteristics can be used to identify shortcomings within existing benchmarks or improve the evaluation of LLMs.
Disciplines :
Computer science
Author, co-author :
TAMBON, Florian  ;  University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SerVal
Nikanjam, Amin ;  Huawei Distributed Scheduling and Data Engine Lab, Canada
Zid, Cyrine ;  Polytechnique Montreal, Canada
Khomh, Foutse ;  Polytechnique Montreal, Canada
Antoniol, Giuliano ;  Polytechnique Montreal, Canada
External co-authors :
yes
Language :
English
Title :
TaskEval: Assessing Difficulty of Code Generation Tasks for Large Language Models
Publication date :
2025
Journal title :
ACM Transactions on Software Engineering and Methodology
ISSN :
1049-331X
Publisher :
Association for Computing Machinery (ACM)
Peer reviewed :
Peer Reviewed verified by ORBi
Available on ORBilu :
since 08 January 2026

Statistics


Number of views
18 (0 by Unilu)
Number of downloads
3 (0 by Unilu)

OpenCitations
 
0
OpenAlex citations
 
0

Bibliography


Similar publications



Contact ORBilu