Article (Scientific journals)
A benchmark of expert-level academic questions to assess AI capabilities.
Center for AI Safety; Scale AI; HLE Contributors Consortium et al.
2026In Nature, 649 (8099), p. 1139 - 1146
Peer Reviewed verified by ORBi
 

Files


Full Text
s41586-025-09962-4.pdf
Author postprint (3.47 MB)
Download

All documents in ORBilu are protected by a user license.

Send to



Details



Keywords :
Humans; Benchmarking/methods; Benchmarking/standards; Artificial Intelligence/standards; Language; Educational Measurement/methods; Educational Measurement/standards; Artificial Intelligence; Benchmarking; Educational Measurement; Multidisciplinary; LLM; AI
Abstract :
[en] Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve more than 90% accuracy on popular benchmarks such as Measuring Massive Multitask Language Understanding1, limiting informed measurement of state-of-the-art LLM capabilities. Here, in response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be an expert-level closed-ended academic benchmark with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable but cannot be quickly answered by internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a marked gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai .
Disciplines :
Computer science
Author, co-author :
Center for AI Safety
Scale AI
HLE Contributors Consortium
KUCHKIN, Vladyslav  ;  University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Physics and Materials Science (DPHYMS)
External co-authors :
no
Language :
English
Title :
A benchmark of expert-level academic questions to assess AI capabilities.
Publication date :
January 2026
Journal title :
Nature
ISSN :
0028-0836
eISSN :
1476-4687
Publisher :
Springer Science and Business Media LLC, England
Volume :
649
Issue :
8099
Pages :
1139 - 1146
Peer reviewed :
Peer Reviewed verified by ORBi
Available on ORBilu :
since 09 February 2026

Statistics


Number of views
71 (1 by Unilu)
Number of downloads
158 (0 by Unilu)

Scopus citations®
 
2
Scopus citations®
without self-citations
2
OpenCitations
 
0
OpenAlex citations
 
1

Bibliography


Similar publications



Contact ORBilu