Putnam-AXIOM:一個功能與靜態的基準測試
Putnam-AXIOM: A Functional and Static Benchmark
August 5, 2025
作者: Aryan Gulati, Brando Miranda, Eric Chen, Emily Xia, Kai Fronsdal, Bruno Dumont, Elyas Obbad, Sanmi Koyejo
cs.AI
摘要
当前针对大型语言模型(LLMs)的数学推理基准测试正趋于饱和,部分测试的准确率已超过90%,但训练集污染问题日益严重。为此,我们推出了Putnam-AXIOM基准,该基准包含522道来自享有盛誉的威廉·洛厄尔·普特南数学竞赛的大学级别竞赛题目,以及Putnam-AXIOM变体集,这是一个由100道功能变体组成的未见过的配套集,这些变体通过程序化地扰动变量和常数生成。这一变体协议能够产生无限数量的难度相当、未经见过的实例,从而构建了一个抗污染测试平台。在原始集上,OpenAI的o1-preview——评估中最强的模型——得分为41.9%,但在配对的变体集上,其准确率下降了19.6%(相对减少46.8%)。其余十八个模型也呈现出相同的下降趋势,其中十个模型的95%置信区间无重叠。这些差距暗示了记忆效应,并强调了动态基准的必要性。我们以“盒装”准确率为基础,补充了教师强制准确率(TFA),这是一种轻量级指标,直接对推理轨迹评分并自动化自然语言证明评估。因此,Putnam-AXIOM为评估LLMs的高级数学推理能力提供了一个严谨、抗污染的评估框架。数据和评估代码已公开于https://github.com/brando90/putnam-axiom。
English
Current mathematical reasoning benchmarks for large language models (LLMs)
are approaching saturation, with some achieving > 90% accuracy, and are
increasingly compromised by training-set contamination. We introduce
Putnam-AXIOM, a benchmark of 522 university-level competition problems drawn
from the prestigious William Lowell Putnam Mathematical Competition, and
Putnam-AXIOM Variation, an unseen companion set of 100 functional variants
generated by programmatically perturbing variables and constants. The variation
protocol produces an unlimited stream of equally difficult, unseen instances --
yielding a contamination-resilient test bed. On the Original set, OpenAI's
o1-preview -- the strongest evaluated model -- scores 41.9%, but its accuracy
drops by 19.6% (46.8% relative decrease) on the paired Variations. The
remaining eighteen models show the same downward trend, ten of them with
non-overlapping 95% confidence intervals. These gaps suggest memorization and
highlight the necessity of dynamic benchmarks. We complement "boxed" accuracy
with Teacher-Forced Accuracy (TFA), a lightweight metric that directly scores
reasoning traces and automates natural language proof evaluations. Putnam-AXIOM
therefore provides a rigorous, contamination-resilient evaluation framework for
assessing advanced mathematical reasoning of LLMs. Data and evaluation code are
publicly available at https://github.com/brando90/putnam-axiom.