挑戰推理的邊界:為大型語言模型打造的奧林匹克級數學基準
Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models
March 27, 2025
作者: Haoxiang Sun, Yingqian Min, Zhipeng Chen, Wayne Xin Zhao, Zheng Liu, Zhongyuan Wang, Lei Fang, Ji-Rong Wen
cs.AI
摘要
近年來,大型推理模型的快速發展導致現有數學推理評估基準趨於飽和,凸顯出對更具挑戰性和嚴謹性評估框架的迫切需求。為填補這一空白,我們推出了OlymMATH,一個全新的奧林匹克級數學基準,旨在嚴格測試大型語言模型(LLMs)的複雜推理能力。OlymMATH包含200道精心挑選的題目,每道題目均經過人工驗證,並提供平行中英文版本。這些題目系統性地分為兩個不同的難度層次:(1) AIME級別題目(易),為數學推理評估建立基礎;(2) 更具挑戰性的題目(難),旨在突破當前最先進模型的極限。在我們的基準中,這些題目涵蓋四個核心數學領域,每道題目均包含可驗證的數值解,以支持客觀、基於規則的評估。實證結果表明,OlymMATH帶來了顯著的挑戰,包括DeepSeek-R1和OpenAI的o3-mini在內的最先進模型在難題子集上的準確率明顯受限。此外,該基準促進了數學推理能力的全面雙語評估——這一關鍵維度在主流數學推理基準中仍未被充分解決。我們在STILL項目中發布了OlymMATH基準:https://github.com/RUCAIBox/Slow_Thinking_with_LLMs。
English
In recent years, the rapid development of large reasoning models has resulted
in the saturation of existing benchmarks for evaluating mathematical reasoning,
highlighting the urgent need for more challenging and rigorous evaluation
frameworks. To address this gap, we introduce OlymMATH, a novel Olympiad-level
mathematical benchmark, designed to rigorously test the complex reasoning
capabilities of LLMs. OlymMATH features 200 meticulously curated problems, each
manually verified and available in parallel English and Chinese versions. The
problems are systematically organized into two distinct difficulty tiers: (1)
AIME-level problems (easy) that establish a baseline for mathematical reasoning
assessment, and (2) significantly more challenging problems (hard) designed to
push the boundaries of current state-of-the-art models. In our benchmark, these
problems span four core mathematical fields, each including a verifiable
numerical solution to enable objective, rule-based evaluation. Empirical
results underscore the significant challenge presented by OlymMATH, with
state-of-the-art models including DeepSeek-R1 and OpenAI's o3-mini
demonstrating notably limited accuracy on the hard subset. Furthermore, the
benchmark facilitates comprehensive bilingual assessment of mathematical
reasoning abilities-a critical dimension that remains largely unaddressed in
mainstream mathematical reasoning benchmarks. We release the OlymMATH benchmark
at the STILL project: https://github.com/RUCAIBox/Slow_Thinking_with_LLMs.Summary
AI-Generated Summary