ChatPaper.aiChatPaper

HLE认证:人类终极考试的系统性验证与结构化修订

HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam

February 15, 2026
作者: Weiqi Zhai, Zhihai Wang, Jinghang Wang, Boyu Yang, Xiaogang Li, Xiang Xu, Bohan Wang, Peng Wang, Xingzhe Wu, Anfeng Li, Qiyuan Feng, Yuhao Zhou, Shoulin Han, Wenjie Luo, Yiyuan Li, Yaxuan Wang, Ruixian Luo, Guojie Lin, Peiyao Xiao, Chengliang Xu, Ben Wang, Zeyu Wang, Zichao Chen, Jianan Ye, Yijie Hu, Jialong Chen, Zongwen Shen, Yuliang Xu, An Yang, Bowen Yu, Dayiheng Liu, Junyang Lin, Hu Wei, Que Shen, Bing Zhao
cs.AI

摘要

人类终极考试(HLE)已成为评估前沿大语言模型在跨领域复杂问题表现的重要基准。然而社区分析指出,该基准存在相当数量的噪声题目,可能扭曲评估结果与模型间比较。为应对此挑战,我们推出HLE-Verified——一个经过透明验证流程与精细错误分类体系修订的认证基准。其构建采用两阶段"验证-修复"工作流:第一阶段通过领域专家评审与模型交叉核查,对每道题目的问题表述及参考答案进行二元验证,最终获得641道认证题目;第二阶段在严格保持原评估意图的前提下,通过双盲专家修复、模型辅助审计与终审裁定,将可修复的缺陷题目修订为1,170道认证题目。其余689道题目则作为标注不确定来源与专业标签的存疑集开放,供后续优化。我们在HLE与HLE-Verified上评估了七大前沿语言模型,发现后者使模型平均绝对准确率提升7-10个百分点。在原始题目表述或参考答案存在错误的题目上,提升幅度达30-40个百分点。分析进一步揭示模型置信度与题目缺陷存在强关联,佐证了修订有效性。总体而言,HLE-Verified通过降低标注噪声实现了更可靠的模型能力度量。数据详见:https://github.com/SKYLENAGE-AI/HLE-Verified
English
Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions. However, community-led analyses have raised concerns that HLE contains a non-trivial number of noisy items, which can bias evaluation results and distort cross-model comparisons. To address this challenge, we introduce HLE-Verified, a verified and revised version of HLE with a transparent verification protocol and fine-grained error taxonomy. Our construction follows a two-stage validation-and-repair workflow resulting in a certified benchmark. In Stage I, each item undergoes binary validation of the problem and final answer through domain-expert review and model-based cross-checks, yielding 641 verified items. In Stage II, flawed but fixable items are revised under strict constraints preserving the original evaluation intent, through dual independent expert repairs, model-assisted auditing, and final adjudication, resulting in 1,170 revised-and-certified items. The remaining 689 items are released as a documented uncertain set with explicit uncertainty sources and expertise tags for future refinement. We evaluate seven state-of-the-art language models on HLE and HLE-Verified, observing an average absolute accuracy gain of 7--10 percentage points on HLE-Verified. The improvement is particularly pronounced on items where the original problem statement and/or reference answer is erroneous, with gains of 30--40 percentage points. Our analyses further reveal a strong association between model confidence and the presence of errors in the problem statement or reference answer, supporting the effectiveness of our revisions. Overall, HLE-Verified improves HLE-style evaluations by reducing annotation noise and enabling more faithful measurement of model capabilities. Data is available at: https://github.com/SKYLENAGE-AI/HLE-Verified
PDF11February 19, 2026