ChatPaper.aiChatPaper

迈向稳健的数学推理

Towards Robust Mathematical Reasoning

November 3, 2025
作者: Thang Luong, Dawsen Hwang, Hoang H. Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Clara Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Xingyou Song, Trieu H. Trinh, Quoc V. Le, Junehyuk Jung
cs.AI

摘要

确立正确的北极星指标对于提升基础模型的数学推理能力至关重要,尤其是考虑到现有评估方法要么过于简单,要么仅关注简短答案的正确性。为解决这些问题,我们推出IMO-Bench——一套经顶尖专家团队审核、专门针对国际数学奥林匹克(IMO)竞赛水平设计的高阶推理基准。该套件包含IMO-AnswerBench(含400道可验证简短答案的奥赛题目)和IMO-ProofBench(配备分级标准的证明题评估集),前者测试模型对多样化奥赛问题的解答能力,后者通过基础与高阶IMO题型评估证明生成能力。这些基准在我们实现IMO 2025金奖的历史性突破中发挥了关键作用(Luong与Lockhart,2025)。我们的模型在IMO-AnswerBench上达到80.0%准确率,在高阶IMO-ProofBench上获得65.7%得分,分别以6.9%和42.4%的显著优势超越非Gemini最佳模型。研究还表明,基于Gemini推理构建的自动评分器与人工评估高度相关,我们据此建立了包含1000条人工证明评分的IMO-GradingBench,以推动长答案自动评估的发展。我们期待IMO-Bench能助力学界推进稳健的数学推理研究,相关资源已发布于https://imobench.github.io/。
English
Finding the right north-star metrics is highly critical for advancing the mathematical reasoning capabilities of foundation models, especially given that existing evaluations are either too easy or only focus on getting correct short answers. To address these issues, we present IMO-Bench, a suite of advanced reasoning benchmarks, vetted by a panel of top specialists and that specifically targets the level of the International Mathematical Olympiad (IMO), the most prestigious venue for young mathematicians. IMO-AnswerBench first tests models on 400 diverse Olympiad problems with verifiable short answers. IMO-Proof Bench is the next-level evaluation for proof-writing capabilities, which includes both basic and advanced IMO level problems as well as detailed grading guidelines to facilitate automatic grading. These benchmarks played a crucial role in our historic achievement of the gold-level performance at IMO 2025 with Gemini Deep Think (Luong and Lockhart, 2025). Our model achieved 80.0% on IMO-AnswerBench and 65.7% on the advanced IMO-Proof Bench, surpassing the best non-Gemini models by large margins of 6.9% and 42.4% respectively. We also showed that autograders built with Gemini reasoning correlate well with human evaluations and construct IMO-GradingBench, with 1000 human gradings on proofs, to enable further progress in automatic evaluation of long-form answers. We hope that IMO-Bench will help the community towards advancing robust mathematical reasoning and release it at https://imobench.github.io/.
PDF71January 19, 2026