ChatPaper.aiChatPaper

当模型智能超越评估者时,基准测试趋于饱和

Benchmarks Saturate When The Model Gets Smarter Than The Judge

January 27, 2026
作者: Marthe Ballon, Andres Algaba, Brecht Verbeken, Vincent Ginis
cs.AI

摘要

基準測試是追蹤大型語言模型發展進程的重要工具,然而數據集與評估方法的準確性問題卻持續削弱其有效性。本文提出Omni-MATH-2——經人工校訂的Omni-MATH數據集升級版,包含精確答案子集(n=4181)與帶標籤的非標準子集(n=247)。我們對每道題目進行審核以確保LaTeX可編譯性、可解性與可驗證性,具體措施包括補充缺失圖形或信息、標註需證明/估算/圖像輔助的題目,並剔除冗余內容。這一流程顯著降低了數據集導致的噪聲,從而為模型性能提供更精準的評估。註解化數據集還使我們能通過比較GPT-5 mini與原始Omni-Judge來評估評判器導致的噪聲,結果顯示兩者在純淨子集與標籤子集上均存在顯著差異。專家標註表明,在評判分歧案例中Omni-Judge的錯誤率高達96.4%,證明其即便在基準測試遠未飽和前已無法有效區分模型能力。隨著問題難度提升,我們發現需要能力更強的評判器來防止評判錯誤掩蓋模型間的真實差異。最後,兩款評判器均未能識別標籤題子集中的現有失效模式,這表明數據集質量與評判器可靠性對構建精準的模型性能基準同樣關鍵。
English
Benchmarks are important tools to track progress in the development of Large Language Models (LLMs), yet inaccuracies in datasets and evaluation methods consistently undermine their effectiveness. Here, we present Omni-MATH-2, a manually revised version of the Omni-MATH dataset comprising a clean, exact-answer subset (n{=}4181) and a tagged, non-standard subset (n{=}247). Each problem was audited to ensure LaTeX compilability, solvability and verifiability, which involved adding missing figures or information, labeling problems requiring a proof, estimation or image, and removing clutter. This process significantly reduces dataset-induced noise, thereby providing a more precise assessment of model performance. The annotated dataset also allows us to evaluate judge-induced noise by comparing GPT-5 mini with the original Omni-Judge, revealing substantial discrepancies between judges on both the clean and tagged problem subsets. Expert annotations reveal that Omni-Judge is wrong in 96.4% of the judge disagreements, indicating its inability to differentiate between models' abilities, even well before saturation of the benchmark occurs. As problems become more challenging, we find that increasingly competent judges become essential in order to prevent judge errors from masking genuine differences between models. Finally, neither judge identifies the present failure modes for the subset of tagged problems, demonstrating that dataset quality and judge reliability are both critical to develop accurate benchmarks of model performance.
PDF12January 29, 2026