推論の限界に挑む：大規模言語モデルのためのオリンピアードレベル数学ベンチマーク

要旨

近年、大規模推論モデルの急速な発展により、数学的推論を評価するための既存のベンチマークが飽和状態に達し、より挑戦的で厳密な評価フレームワークの必要性が緊急に求められています。このギャップを埋めるため、我々はOlymMATHを導入します。これは、LLMの複雑な推論能力を厳密にテストするために設計された、オリンピアドレベルの数学的ベンチマークです。OlymMATHは、200の入念に選ばれた問題を特徴としており、各問題は手動で検証され、英語と中国語の並行バージョンが用意されています。これらの問題は、体系的に2つの異なる難易度層に分類されています：(1) 数学的推論評価のベースラインを確立するAIMEレベルの問題（易しい）、および (2) 現在の最先端モデルの限界を押し上げるために設計された、より挑戦的な問題（難しい）。我々のベンチマークでは、これらの問題は4つの主要な数学分野にまたがり、それぞれ検証可能な数値解を含むことで、客観的でルールベースの評価を可能にしています。実証結果は、OlymMATHが提示する重要な課題を強調しており、DeepSeek-R1やOpenAIのo3-miniを含む最先端モデルでも、難しいサブセットでの精度が著しく限られていることが示されています。さらに、このベンチマークは、数学的推論能力の包括的な二言語評価を可能にします。これは、主流の数学的推論ベンチマークではほとんど取り組まれていない重要な側面です。我々は、OlymMATHベンチマークをSTILLプロジェクトで公開しています：https://github.com/RUCAIBox/Slow_Thinking_with_LLMs。

English

In recent years, the rapid development of large reasoning models has resulted in the saturation of existing benchmarks for evaluating mathematical reasoning, highlighting the urgent need for more challenging and rigorous evaluation frameworks. To address this gap, we introduce OlymMATH, a novel Olympiad-level mathematical benchmark, designed to rigorously test the complex reasoning capabilities of LLMs. OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions. The problems are systematically organized into two distinct difficulty tiers: (1) AIME-level problems (easy) that establish a baseline for mathematical reasoning assessment, and (2) significantly more challenging problems (hard) designed to push the boundaries of current state-of-the-art models. In our benchmark, these problems span four core mathematical fields, each including a verifiable numerical solution to enable objective, rule-based evaluation. Empirical results underscore the significant challenge presented by OlymMATH, with state-of-the-art models including DeepSeek-R1 and OpenAI's o3-mini demonstrating notably limited accuracy on the hard subset. Furthermore, the benchmark facilitates comprehensive bilingual assessment of mathematical reasoning abilities-a critical dimension that remains largely unaddressed in mainstream mathematical reasoning benchmarks. We release the OlymMATH benchmark at the STILL project: https://github.com/RUCAIBox/Slow_Thinking_with_LLMs.

推論の限界に挑む：大規模言語モデルのためのオリンピアードレベル数学ベンチマーク

Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models

要旨

Support