挑战推理边界：面向大语言模型的奥林匹克级数学基准

摘要

近年来，大型推理模型的快速发展导致现有数学推理评估基准趋于饱和，凸显出对更具挑战性和严谨性评估框架的迫切需求。为填补这一空白，我们推出了OlymMATH，一个新颖的奥林匹克级别数学基准，旨在严格测试大语言模型（LLMs）的复杂推理能力。OlymMATH包含200道精心挑选的题目，每道题均经过人工验证，并提供中英文双语版本。这些题目系统地分为两个难度层级：（1）AIME级别题目（易），为数学推理评估建立基础；（2）显著更具挑战性的题目（难），旨在突破当前最先进模型的极限。在我们的基准中，这些问题涵盖四个核心数学领域，每道题均包含可验证的数值解，以实现客观、基于规则的评估。实证结果凸显了OlymMATH带来的重大挑战，包括DeepSeek-R1和OpenAI的o3-mini在内的最先进模型在难题子集上的准确率显著受限。此外，该基准促进了数学推理能力的全面双语评估——这一关键维度在主流数学推理基准中仍未被充分关注。我们在STILL项目发布了OlymMATH基准：https://github.com/RUCAIBox/Slow_Thinking_with_LLMs。

English

In recent years, the rapid development of large reasoning models has resulted in the saturation of existing benchmarks for evaluating mathematical reasoning, highlighting the urgent need for more challenging and rigorous evaluation frameworks. To address this gap, we introduce OlymMATH, a novel Olympiad-level mathematical benchmark, designed to rigorously test the complex reasoning capabilities of LLMs. OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions. The problems are systematically organized into two distinct difficulty tiers: (1) AIME-level problems (easy) that establish a baseline for mathematical reasoning assessment, and (2) significantly more challenging problems (hard) designed to push the boundaries of current state-of-the-art models. In our benchmark, these problems span four core mathematical fields, each including a verifiable numerical solution to enable objective, rule-based evaluation. Empirical results underscore the significant challenge presented by OlymMATH, with state-of-the-art models including DeepSeek-R1 and OpenAI's o3-mini demonstrating notably limited accuracy on the hard subset. Furthermore, the benchmark facilitates comprehensive bilingual assessment of mathematical reasoning abilities-a critical dimension that remains largely unaddressed in mainstream mathematical reasoning benchmarks. We release the OlymMATH benchmark at the STILL project: https://github.com/RUCAIBox/Slow_Thinking_with_LLMs.

挑战推理边界：面向大语言模型的奥林匹克级数学基准

Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models

摘要

Support