Omni-MATH：大型語言模型的通用奧林匹亞級數學基準

摘要

近期對大型語言模型（LLMs）的進展已顯著提升數學推理能力。然而，現有的基準測試如GSM8K或MATH現在以高準確度解決（例如，OpenAI o1在MATH數據集上達到94.8%），顯示這些基準測試對於真正挑戰這些模型來說是不足的。為了彌合這一差距，我們提出了一個全面且具有挑戰性的基準測試，專門設計來評估LLMs在奧林匹亞級別的數學推理能力。與現有的與奧林匹亞有關的基準測試不同，我們的數據集專注於數學，包括一個由4428個競賽級別問題組成的龐大集合，並經過嚴格的人工標註。這些問題被精心分類為33個子領域以上，涵蓋超過10個不同難度級別，使得能夠全面評估模型在奧林匹亞數學推理中的表現。此外，我們基於這一基準測試進行了深入分析。我們的實驗結果顯示，即使是最先進的模型，如OpenAI o1-mini和OpenAI o1-preview，也在高度具有挑戰性的奧林匹亞級別問題上遇到困難，準確率分別為60.54%和52.55%，凸顯了在奧林匹亞級別數學推理中存在的重大挑戰。

English

Recent advancements in large language models (LLMs) have led to significant breakthroughs in mathematical reasoning capabilities. However, existing benchmarks like GSM8K or MATH are now being solved with high accuracy (e.g., OpenAI o1 achieves 94.8% on MATH dataset), indicating their inadequacy for truly challenging these models. To bridge this gap, we propose a comprehensive and challenging benchmark specifically designed to assess LLMs' mathematical reasoning at the Olympiad level. Unlike existing Olympiad-related benchmarks, our dataset focuses exclusively on mathematics and comprises a vast collection of 4428 competition-level problems with rigorous human annotation. These problems are meticulously categorized into over 33 sub-domains and span more than 10 distinct difficulty levels, enabling a holistic assessment of model performance in Olympiad-mathematical reasoning. Furthermore, we conducted an in-depth analysis based on this benchmark. Our experimental results show that even the most advanced models, OpenAI o1-mini and OpenAI o1-preview, struggle with highly challenging Olympiad-level problems, with 60.54% and 52.55% accuracy, highlighting significant challenges in Olympiad-level mathematical reasoning.

Omni-MATH：大型語言模型的通用奧林匹亞級數學基準

Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

摘要

Support