QEDBENCH：量化大学数学证明自动评估中的对齐差距

摘要

随着大语言模型在基础评测集上渐趋饱和，研究前沿已从生成能力转向自动化评估的可靠性。我们发现，当标准“LLM即评委”协议应用于高年级本科至研究生初级数学水平时，存在系统性对齐差距。为量化这一现象，我们推出QEDBench——首个大规模双标尺对齐基准，通过对比课程特定评分标准与专家常识准则，系统衡量大学数学证明与人类专家的一致性。基于双评估矩阵（7位评委×5个求解器）对超1000小时人工评估数据的分析，揭示某些前沿评估模型（如Claude Opus 4.5、DeepSeek-V3、Qwen 2.5 Max和Llama 4 Maverick）存在显著正向偏差（平均分数膨胀分别达+0.18、+0.20、+0.30、+0.36）。此外，我们发现在离散数学领域存在关键推理断层：虽然Gemini 3.0 Pro达到顶尖水平（人工评估均分0.91），但其他推理模型如GPT-5 Pro和Claude Sonnet 4.5在离散领域表现显著下滑，其离散数学人工评估均分分别降至0.72和0.63，图论领域则降至0.74和0.50。除研究成果外，我们同步公开QEDBench作为评估和改进AI评委的公共基准，代码库已发布于https://github.com/qqliu/Yale-QEDBench。

English

As Large Language Models (LLMs) saturate elementary benchmarks, the research frontier has shifted from generation to the reliability of automated evaluation. We demonstrate that standard "LLM-as-a-Judge" protocols suffer from a systematic Alignment Gap when applied to upper-undergraduate to early graduate level mathematics. To quantify this, we introduce QEDBench, the first large-scale dual-rubric alignment benchmark to systematically measure alignment with human experts on university-level math proofs by contrasting course-specific rubrics against expert common knowledge criteria. By deploying a dual-evaluation matrix (7 judges x 5 solvers) against 1,000+ hours of human evaluation, we reveal that certain frontier evaluators like Claude Opus 4.5, DeepSeek-V3, Qwen 2.5 Max, and Llama 4 Maverick exhibit significant positive bias (up to +0.18, +0.20, +0.30, +0.36 mean score inflation, respectively). Furthermore, we uncover a critical reasoning gap in the discrete domain: while Gemini 3.0 Pro achieves state-of-the-art performance (0.91 average human evaluation score), other reasoning models like GPT-5 Pro and Claude Sonnet 4.5 see their performance significantly degrade in discrete domains. Specifically, their average human evaluation scores drop to 0.72 and 0.63 in Discrete Math, and to 0.74 and 0.50 in Graph Theory. In addition to these research results, we also release QEDBench as a public benchmark for evaluating and improving AI judges. Our benchmark is publicly published at https://github.com/qqliu/Yale-QEDBench.