QEDBENCH: 대학 수준 수학 증명의 자동 평가에서 정렬 격차 정량화

초록

대규모 언어 모델(LLM)이 기초적인 벤치마크에서 포화 상태에 이르면서 연구 프론티어는 생성에서 자동화된 평가의 신뢰성으로 전환되고 있습니다. 본 연구는 표준 "LLM-as-a-Judge" 프로토콜이 고학년 학부부터 초기 대학원 수준의 수학 문제에 적용될 때 체계적인 정렬 격차(Alignment Gap)를 겪는다는 점을 보여줍니다. 이를 정량화하기 위해 우리는 QEDBench를 소개합니다. 이는 대학 수준 수학 증명에 대한 인간 전문가와의 정렬 정도를 체계적으로 측정하기 위한 최초의 대규모 이중 평가 기준(Dual-Rubric) 정렬 벤치마크로, 강의 특화 평가 기준과 전문가의 상식적 기준을 대비하여 평가합니다. 1,000시간 이상의 인간 평가를 바탕으로 7명의 평가자와 5개의 솔버로 구성된 이중 평가 매트릭스를 배포한 결과, Claude Opus 4.5, DeepSeek-V3, Qwen 2.5 Max, Llama 4 Maverick과 같은 특정 최신 평가 모델들이 각각 최대 +0.18, +0.20, +0.30, +0.36의 평균 점수 인플레이션으로 나타나는 상당한 양의 편향(Positive Bias)을 보임을 확인했습니다. 더 나아가 우리는 이산 수학 영역에서 중요한 추론 격차를 발견했습니다: Gemini 3.0 Pro는 최첨단 성능(평균 인간 평가 점수 0.91)을 달성한 반면, GPT-5 Pro 및 Claude Sonnet 4.5와 같은 다른 추론 모델들의 성능은 이산 영역에서 현저히 저하되었습니다. 구체적으로, 이들의 평균 인간 평가 점수는 이산수학에서 각각 0.72와 0.63으로, 그래프 이론에서는 0.74와 0.50으로 떨어졌습니다. 이러한 연구 결과와 함께, 우리는 AI 평가자의 성능을 평가하고 개선하기 위한 공개 벤치마크로 QEDBench를 공개합니다. 우리의 벤치마크는 https://github.com/qqliu/Yale-QEDBench 에서 공개되었습니다.

English

As Large Language Models (LLMs) saturate elementary benchmarks, the research frontier has shifted from generation to the reliability of automated evaluation. We demonstrate that standard "LLM-as-a-Judge" protocols suffer from a systematic Alignment Gap when applied to upper-undergraduate to early graduate level mathematics. To quantify this, we introduce QEDBench, the first large-scale dual-rubric alignment benchmark to systematically measure alignment with human experts on university-level math proofs by contrasting course-specific rubrics against expert common knowledge criteria. By deploying a dual-evaluation matrix (7 judges x 5 solvers) against 1,000+ hours of human evaluation, we reveal that certain frontier evaluators like Claude Opus 4.5, DeepSeek-V3, Qwen 2.5 Max, and Llama 4 Maverick exhibit significant positive bias (up to +0.18, +0.20, +0.30, +0.36 mean score inflation, respectively). Furthermore, we uncover a critical reasoning gap in the discrete domain: while Gemini 3.0 Pro achieves state-of-the-art performance (0.91 average human evaluation score), other reasoning models like GPT-5 Pro and Claude Sonnet 4.5 see their performance significantly degrade in discrete domains. Specifically, their average human evaluation scores drop to 0.72 and 0.63 in Discrete Math, and to 0.74 and 0.50 in Graph Theory. In addition to these research results, we also release QEDBench as a public benchmark for evaluating and improving AI judges. Our benchmark is publicly published at https://github.com/qqliu/Yale-QEDBench.

QEDBENCH: 대학 수준 수학 증명의 자동 평가에서 정렬 격차 정량화

QEDBENCH: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs

초록

Support