QEDBENCH:量化大学数学证明自动化评估中的对齐差距
QEDBENCH: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs
February 24, 2026
作者: Santiago Gonzalez, Alireza Amiri Bavandpour, Peter Ye, Edward Zhang, Ruslans Aleksejevs, Todor Antić, Polina Baron, Sujeet Bhalerao, Shubhrajit Bhattacharya, Zachary Burton, John Byrne, Hyungjun Choi, Nujhat Ahmed Disha, Koppany István Encz, Yuchen Fang, Robert Joseph George, Ebrahim Ghorbani, Alan Goldfarb, Jing Guo, Meghal Gupta, Stefano Huber, Annika Kanckos, Minjung Kang, Hyun Jong Kim, Dino Lorenzini, Levi Lorenzo, Tianyi Mao, Giovanni Marzenta, Ariane M. Masuda, Lukas Mauth, Ana Mickovic, Andres Miniguano-Trujillo, Antoine Moulin, Wenqi Ni, Tomos Parry, Kevin Ren, Hossein Roodbarani, Mathieu Rundström, Manjil Saikia, Detchat Samart, Rebecca Steiner, Connor Stewart, Dhara Thakkar, Jeffrey Tse, Vasiliki Velona, Yunhai Xiang, Sibel Yalçın, Jun Yan, Ji Zeng, Arman Cohan, Quanquan C. Liu
cs.AI
摘要
随着大语言模型(LLM)在基础评测集上趋近性能饱和,研究前沿已从生成能力转向自动化评估的可靠性。我们发现,当标准"LLM即评委"协议应用于高年级本科至研究生低年级数学领域时,会出现系统性对齐差距。为量化这一现象,我们推出QEDBench——首个大规模双标尺对齐基准,通过对比课程特定评分标准与专家常识性准则,系统化衡量大学数学证明任务中模型与人类专家的对齐程度。通过部署双评估矩阵(7种评委模型×5种求解模型)并对标1,000余小时的人工评估,我们揭示某些前沿评估模型(如Claude Opus 4.5、DeepSeek-V3、Qwen 2.5 Max和Llama 4 Maverick)存在显著正向偏差(平均分数膨胀率分别达+0.18、+0.20、+0.30、+0.36)。更重要的是,我们发现了离散数学领域的核心推理缺陷:虽然Gemini 3.0 Pro达到顶尖水平(人类评估均分0.91),但其他推理模型(如GPT-5 Pro和Claude Sonnet 4.5)在离散领域表现显著下滑,其在离散数学的人类评估均分分别降至0.72和0.63,在图论中更是跌至0.74和0.50。除研究成果外,我们同步开放QEDBench作为公共基准,用于评估和改进AI评委系统。本基准已发布于https://github.com/qqliu/Yale-QEDBench。
English
As Large Language Models (LLMs) saturate elementary benchmarks, the research frontier has shifted from generation to the reliability of automated evaluation. We demonstrate that standard "LLM-as-a-Judge" protocols suffer from a systematic Alignment Gap when applied to upper-undergraduate to early graduate level mathematics. To quantify this, we introduce QEDBench, the first large-scale dual-rubric alignment benchmark to systematically measure alignment with human experts on university-level math proofs by contrasting course-specific rubrics against expert common knowledge criteria. By deploying a dual-evaluation matrix (7 judges x 5 solvers) against 1,000+ hours of human evaluation, we reveal that certain frontier evaluators like Claude Opus 4.5, DeepSeek-V3, Qwen 2.5 Max, and Llama 4 Maverick exhibit significant positive bias (up to +0.18, +0.20, +0.30, +0.36 mean score inflation, respectively). Furthermore, we uncover a critical reasoning gap in the discrete domain: while Gemini 3.0 Pro achieves state-of-the-art performance (0.91 average human evaluation score), other reasoning models like GPT-5 Pro and Claude Sonnet 4.5 see their performance significantly degrade in discrete domains. Specifically, their average human evaluation scores drop to 0.72 and 0.63 in Discrete Math, and to 0.74 and 0.50 in Graph Theory. In addition to these research results, we also release QEDBench as a public benchmark for evaluating and improving AI judges. Our benchmark is publicly published at https://github.com/qqliu/Yale-QEDBench.