ChatPaper.aiChatPaper

当模型智能超越评估者时,基准测试趋于饱和

Benchmarks Saturate When The Model Gets Smarter Than The Judge

January 27, 2026
作者: Marthe Ballon, Andres Algaba, Brecht Verbeken, Vincent Ginis
cs.AI

摘要

基准测试是追踪大语言模型发展进程的重要工具,然而数据集与评估方法中的误差持续削弱其有效性。本文推出Omni-MATH-2——经人工校订的Omni-MATH数据集版本,包含精确答案的洁净子集(n=4181)和带标签的非标准子集(n=247)。我们逐题审核以确保LaTeX可编译性、可解性与可验证性,具体措施包括补充缺失图表信息、标注需证明/估算/图像辅助的题目,并剔除冗余内容。这一流程显著降低了数据集引发的噪声,从而为模型性能提供更精准的评估。带标注的数据集还支持通过对比GPT-5 mini与原始Omni-Judge来评估评判者引发的噪声,结果显示两类评判者在洁净子集和带标签子集上均存在显著差异。专家标注表明,在评判分歧案例中Omni-Judge的错误率高达96.4%,证明其无法有效区分模型能力——甚至在基准测试远未达到饱和前已然如此。随着问题难度提升,我们发现必须采用能力更强的评判者,以防止评判错误掩盖模型间的真实差异。最后,两类评判者均未能识别带标签子集中存在的当前失效模式,这表明数据集质量与评判者可靠性对构建精准的模型性能基准同样至关重要。
English
Benchmarks are important tools to track progress in the development of Large Language Models (LLMs), yet inaccuracies in datasets and evaluation methods consistently undermine their effectiveness. Here, we present Omni-MATH-2, a manually revised version of the Omni-MATH dataset comprising a clean, exact-answer subset (n{=}4181) and a tagged, non-standard subset (n{=}247). Each problem was audited to ensure LaTeX compilability, solvability and verifiability, which involved adding missing figures or information, labeling problems requiring a proof, estimation or image, and removing clutter. This process significantly reduces dataset-induced noise, thereby providing a more precise assessment of model performance. The annotated dataset also allows us to evaluate judge-induced noise by comparing GPT-5 mini with the original Omni-Judge, revealing substantial discrepancies between judges on both the clean and tagged problem subsets. Expert annotations reveal that Omni-Judge is wrong in 96.4% of the judge disagreements, indicating its inability to differentiate between models' abilities, even well before saturation of the benchmark occurs. As problems become more challenging, we find that increasingly competent judges become essential in order to prevent judge errors from masking genuine differences between models. Finally, neither judge identifies the present failure modes for the subset of tagged problems, demonstrating that dataset quality and judge reliability are both critical to develop accurate benchmarks of model performance.
PDF12January 29, 2026