ChatPaper.aiChatPaper

AMO-Bench:大型語言模型在中學數學競賽中仍表現欠佳

AMO-Bench: Large Language Models Still Struggle in High School Math Competitions

October 30, 2025
作者: Shengnan An, Xunliang Cai, Xuezhi Cao, Xiaoyu Li, Yehao Lin, Junlin Liu, Xinxuan Lv, Dan Ma, Xuanlin Wang, Ziwen Wang, Shuang Zhou
cs.AI

摘要

我們推出AMO-Bench——一個達到奧林匹克競賽甚至更高難度水準的高階數學推理基準測試集,包含50道人為精心設計的題目。現有基準測試普遍採用高中數學競賽題來評估大型語言模型的數學推理能力,但由於性能飽和現象(例如AIME24/25),許多現有數學競賽對頂級LLM的評估效能正逐漸減弱。為解決此問題,AMO-Bench通過以下方式設立更嚴苛的挑戰:(1)所有50道題目均經過專家交叉驗證,確保難度至少達到國際數學奧林匹克競賽標準;(2)全部為原創題目,避免因數據記憶化可能導致的性能洩漏。此外,AMO-Bench中的每道題目僅需最終答案而無需證明過程,這使得評估過程可實現自動化且穩健的評分。在26個LLM上的實驗結果表明,即便最佳性能模型在AMO-Bench上的準確率也僅達52.4%,大多數LLM得分低於40%。除了這些表現不佳的結果,我們進一步分析發現:隨著測試時計算資源的增加,AMO-Bench呈現出可喜的規模化趨勢。這些結果凸顯出現有LLM在數學推理能力方面仍有巨大提升空間。我們公開AMO-Bench以促進後續研究,推動語言模型推理能力的發展。 https://amo-bench.github.io/
English
We present AMO-Bench, an Advanced Mathematical reasoning benchmark with Olympiad level or even higher difficulty, comprising 50 human-crafted problems. Existing benchmarks have widely leveraged high school math competitions for evaluating mathematical reasoning capabilities of large language models (LLMs). However, many existing math competitions are becoming less effective for assessing top-tier LLMs due to performance saturation (e.g., AIME24/25). To address this, AMO-Bench introduces more rigorous challenges by ensuring all 50 problems are (1) cross-validated by experts to meet at least the International Mathematical Olympiad (IMO) difficulty standards, and (2) entirely original problems to prevent potential performance leakages from data memorization. Moreover, each problem in AMO-Bench requires only a final answer rather than a proof, enabling automatic and robust grading for evaluation. Experimental results across 26 LLMs on AMO-Bench show that even the best-performing model achieves only 52.4% accuracy on AMO-Bench, with most LLMs scoring below 40%. Beyond these poor performances, our further analysis reveals a promising scaling trend with increasing test-time compute on AMO-Bench. These results highlight the significant room for improving the mathematical reasoning in current LLMs. We release AMO-Bench to facilitate further research into advancing the reasoning abilities of language models. https://amo-bench.github.io/
PDF331December 2, 2025