Putnam-AXIOM：功能性与静态基准测试

摘要

当前针对大型语言模型（LLMs）的数学推理基准测试正趋于饱和，部分测试准确率已超过90%，且日益受到训练集污染的干扰。为此，我们推出了Putnam-AXIOM基准，该基准包含522道源自享有盛誉的威廉·洛厄尔·普特南数学竞赛的大学级别竞赛题目，以及Putnam-AXIOM变体集，后者由程序化扰动变量和常数生成的100道功能变体组成，确保测试实例的难度相当且未被模型见过，从而构建了一个抗污染测试平台。在原始集上，OpenAI的o1-preview模型——评估中表现最强者——取得了41.9%的准确率，但在配对的变体集上，其准确率下降了19.6%（相对减少46.8%）。其余十八个模型也呈现出相同的下降趋势，其中十个模型的95%置信区间无重叠。这些差距暗示了模型存在记忆现象，并凸显了动态基准测试的必要性。我们不仅采用“盒装”准确率，还引入了教师强制准确率（TFA），这是一种轻量级指标，直接对推理轨迹评分并自动化自然语言证明的评估。因此，Putnam-AXIOM为评估LLMs的高级数学推理能力提供了一个严谨且抗污染的评估框架。相关数据与评估代码已公开于https://github.com/brando90/putnam-axiom。

English

Current mathematical reasoning benchmarks for large language models (LLMs) are approaching saturation, with some achieving > 90% accuracy, and are increasingly compromised by training-set contamination. We introduce Putnam-AXIOM, a benchmark of 522 university-level competition problems drawn from the prestigious William Lowell Putnam Mathematical Competition, and Putnam-AXIOM Variation, an unseen companion set of 100 functional variants generated by programmatically perturbing variables and constants. The variation protocol produces an unlimited stream of equally difficult, unseen instances -- yielding a contamination-resilient test bed. On the Original set, OpenAI's o1-preview -- the strongest evaluated model -- scores 41.9%, but its accuracy drops by 19.6% (46.8% relative decrease) on the paired Variations. The remaining eighteen models show the same downward trend, ten of them with non-overlapping 95% confidence intervals. These gaps suggest memorization and highlight the necessity of dynamic benchmarks. We complement "boxed" accuracy with Teacher-Forced Accuracy (TFA), a lightweight metric that directly scores reasoning traces and automates natural language proof evaluations. Putnam-AXIOM therefore provides a rigorous, contamination-resilient evaluation framework for assessing advanced mathematical reasoning of LLMs. Data and evaluation code are publicly available at https://github.com/brando90/putnam-axiom.

Putnam-AXIOM：功能性与静态基准测试

Putnam-AXIOM: A Functional and Static Benchmark

摘要

Support