挑战推理边界:面向大语言模型的奥林匹克级数学基准
Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models
March 27, 2025
作者: Haoxiang Sun, Yingqian Min, Zhipeng Chen, Wayne Xin Zhao, Zheng Liu, Zhongyuan Wang, Lei Fang, Ji-Rong Wen
cs.AI
摘要
近年来,大型推理模型的快速发展导致现有数学推理评估基准趋于饱和,凸显出对更具挑战性和严谨性评估框架的迫切需求。为填补这一空白,我们推出了OlymMATH,一个新颖的奥林匹克级别数学基准,旨在严格测试大语言模型(LLMs)的复杂推理能力。OlymMATH包含200道精心挑选的题目,每道题均经过人工验证,并提供中英文双语版本。这些题目系统地分为两个难度层级:(1)AIME级别题目(易),为数学推理评估建立基础;(2)显著更具挑战性的题目(难),旨在突破当前最先进模型的极限。在我们的基准中,这些问题涵盖四个核心数学领域,每道题均包含可验证的数值解,以实现客观、基于规则的评估。实证结果凸显了OlymMATH带来的重大挑战,包括DeepSeek-R1和OpenAI的o3-mini在内的最先进模型在难题子集上的准确率显著受限。此外,该基准促进了数学推理能力的全面双语评估——这一关键维度在主流数学推理基准中仍未被充分关注。我们在STILL项目发布了OlymMATH基准:https://github.com/RUCAIBox/Slow_Thinking_with_LLMs。
English
In recent years, the rapid development of large reasoning models has resulted
in the saturation of existing benchmarks for evaluating mathematical reasoning,
highlighting the urgent need for more challenging and rigorous evaluation
frameworks. To address this gap, we introduce OlymMATH, a novel Olympiad-level
mathematical benchmark, designed to rigorously test the complex reasoning
capabilities of LLMs. OlymMATH features 200 meticulously curated problems, each
manually verified and available in parallel English and Chinese versions. The
problems are systematically organized into two distinct difficulty tiers: (1)
AIME-level problems (easy) that establish a baseline for mathematical reasoning
assessment, and (2) significantly more challenging problems (hard) designed to
push the boundaries of current state-of-the-art models. In our benchmark, these
problems span four core mathematical fields, each including a verifiable
numerical solution to enable objective, rule-based evaluation. Empirical
results underscore the significant challenge presented by OlymMATH, with
state-of-the-art models including DeepSeek-R1 and OpenAI's o3-mini
demonstrating notably limited accuracy on the hard subset. Furthermore, the
benchmark facilitates comprehensive bilingual assessment of mathematical
reasoning abilities-a critical dimension that remains largely unaddressed in
mainstream mathematical reasoning benchmarks. We release the OlymMATH benchmark
at the STILL project: https://github.com/RUCAIBox/Slow_Thinking_with_LLMs.Summary
AI-Generated Summary