ChatPaper.aiChatPaper

QEDBENCH:量化大学数学证明自动评估中的对齐差距

QEDBENCH: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs

February 24, 2026
作者: Santiago Gonzalez, Alireza Amiri Bavandpour, Peter Ye, Edward Zhang, Ruslans Aleksejevs, Todor Antić, Polina Baron, Sujeet Bhalerao, Shubhrajit Bhattacharya, Zachary Burton, John Byrne, Hyungjun Choi, Nujhat Ahmed Disha, Koppany István Encz, Yuchen Fang, Robert Joseph George, Ebrahim Ghorbani, Alan Goldfarb, Jing Guo, Meghal Gupta, Stefano Huber, Annika Kanckos, Minjung Kang, Hyun Jong Kim, Dino Lorenzini, Levi Lorenzo, Tianyi Mao, Giovanni Marzenta, Ariane M. Masuda, Lukas Mauth, Ana Mickovic, Andres Miniguano-Trujillo, Antoine Moulin, Wenqi Ni, Tomos Parry, Kevin Ren, Hossein Roodbarani, Mathieu Rundström, Manjil Saikia, Detchat Samart, Rebecca Steiner, Connor Stewart, Dhara Thakkar, Jeffrey Tse, Vasiliki Velona, Yunhai Xiang, Sibel Yalçın, Jun Yan, Ji Zeng, Arman Cohan, Quanquan C. Liu
cs.AI

摘要

随着大语言模型在基础评测集上渐趋饱和,研究前沿已从生成能力转向自动化评估的可靠性。我们发现,当标准“LLM即评委”协议应用于高年级本科至研究生初级数学水平时,存在系统性对齐差距。为量化这一现象,我们推出QEDBench——首个大规模双标尺对齐基准,通过对比课程特定评分标准与专家常识准则,系统衡量大学数学证明与人类专家的一致性。基于双评估矩阵(7位评委×5个求解器)对超1000小时人工评估数据的分析,揭示某些前沿评估模型(如Claude Opus 4.5、DeepSeek-V3、Qwen 2.5 Max和Llama 4 Maverick)存在显著正向偏差(平均分数膨胀分别达+0.18、+0.20、+0.30、+0.36)。此外,我们发现在离散数学领域存在关键推理断层:虽然Gemini 3.0 Pro达到顶尖水平(人工评估均分0.91),但其他推理模型如GPT-5 Pro和Claude Sonnet 4.5在离散领域表现显著下滑,其离散数学人工评估均分分别降至0.72和0.63,图论领域则降至0.74和0.50。除研究成果外,我们同步公开QEDBench作为评估和改进AI评委的公共基准,代码库已发布于https://github.com/qqliu/Yale-QEDBench。
English
As Large Language Models (LLMs) saturate elementary benchmarks, the research frontier has shifted from generation to the reliability of automated evaluation. We demonstrate that standard "LLM-as-a-Judge" protocols suffer from a systematic Alignment Gap when applied to upper-undergraduate to early graduate level mathematics. To quantify this, we introduce QEDBench, the first large-scale dual-rubric alignment benchmark to systematically measure alignment with human experts on university-level math proofs by contrasting course-specific rubrics against expert common knowledge criteria. By deploying a dual-evaluation matrix (7 judges x 5 solvers) against 1,000+ hours of human evaluation, we reveal that certain frontier evaluators like Claude Opus 4.5, DeepSeek-V3, Qwen 2.5 Max, and Llama 4 Maverick exhibit significant positive bias (up to +0.18, +0.20, +0.30, +0.36 mean score inflation, respectively). Furthermore, we uncover a critical reasoning gap in the discrete domain: while Gemini 3.0 Pro achieves state-of-the-art performance (0.91 average human evaluation score), other reasoning models like GPT-5 Pro and Claude Sonnet 4.5 see their performance significantly degrade in discrete domains. Specifically, their average human evaluation scores drop to 0.72 and 0.63 in Discrete Math, and to 0.74 and 0.50 in Graph Theory. In addition to these research results, we also release QEDBench as a public benchmark for evaluating and improving AI judges. Our benchmark is publicly published at https://github.com/qqliu/Yale-QEDBench.
PDF52May 8, 2026