全能数学：大型语言模型的通用奥林匹克级数学基准

摘要

最近大语言模型（LLMs）的进展在数学推理能力方面取得了重大突破。然而，现有的基准测试如GSM8K或MATH现在以高准确度解决（例如，OpenAI o1在MATH数据集上达到94.8%），表明它们对于真正挑战这些模型来说是不足的。为了弥合这一差距，我们提出了一个全面而具有挑战性的基准测试，专门设计用于评估LLMs在奥林匹克水平的数学推理能力。与现有的与奥林匹克相关的基准测试不同，我们的数据集专注于数学，并包括4428个具有严格人工标注的竞赛级问题的庞大集合。这些问题被精心分类为33多个子领域，并涵盖了超过10个不同的难度级别，使得能够全面评估模型在奥林匹克数学推理中的表现。此外，我们基于这一基准测试进行了深入分析。我们的实验结果显示，即使是最先进的模型，如OpenAI o1-mini和OpenAI o1-preview，也在高度具有挑战性的奥林匹克水平问题上遇到困难，准确率分别为60.54%和52.55%，突显了奥林匹克水平数学推理中的重大挑战。

English

Recent advancements in large language models (LLMs) have led to significant breakthroughs in mathematical reasoning capabilities. However, existing benchmarks like GSM8K or MATH are now being solved with high accuracy (e.g., OpenAI o1 achieves 94.8% on MATH dataset), indicating their inadequacy for truly challenging these models. To bridge this gap, we propose a comprehensive and challenging benchmark specifically designed to assess LLMs' mathematical reasoning at the Olympiad level. Unlike existing Olympiad-related benchmarks, our dataset focuses exclusively on mathematics and comprises a vast collection of 4428 competition-level problems with rigorous human annotation. These problems are meticulously categorized into over 33 sub-domains and span more than 10 distinct difficulty levels, enabling a holistic assessment of model performance in Olympiad-mathematical reasoning. Furthermore, we conducted an in-depth analysis based on this benchmark. Our experimental results show that even the most advanced models, OpenAI o1-mini and OpenAI o1-preview, struggle with highly challenging Olympiad-level problems, with 60.54% and 52.55% accuracy, highlighting significant challenges in Olympiad-level mathematical reasoning.

全能数学：大型语言模型的通用奥林匹克级数学基准

Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

摘要

Support