ChatPaper.aiChatPaper

MathNet:面向数学推理与检索的全球多模态基准

MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

April 20, 2026
作者: Shaden Alshammari, Kevin Wen, Abrar Zainal, Mark Hamilton, Navid Safaei, Sultan Albarakati, William T. Freeman, Antonio Torralba
cs.AI

摘要

数学解题能力始终是检验大型语言模型与多模态模型推理能力的难点,但现有基准测试在规模、语言覆盖度和任务多样性方面存在局限。我们推出MathNet——一个高质量、大规模、多模态、多语言的奥数级数学题库及评估基准,用于评估生成式模型的数学推理能力和基于嵌入系统的数学检索能力。该数据集涵盖47个国家、17种语言及近二十年的竞赛题目,包含30,676道专家编写的多领域题目及解答。除核心数据集外,我们还构建了由专家标注的数学等价题与结构相似题对组成的检索基准。 MathNet支持三项任务:(一)数学解题(二)数学感知检索(三)检索增强解题。实验结果表明,即使最先进的推理模型(Gemini-3.1-Pro达78.4%,GPT-5达69.3%)仍面临挑战,而嵌入模型在检索等价题目时表现欠佳。我们进一步发现检索增强生成性能对检索质量高度敏感:例如DeepSeek-V3.2-Speciale通过优质检索实现最高12%的性能提升,创下基准测试最佳成绩。MathNet提供了规模最大的高质量奥数数据集及首个数学题目检索评估基准,相关资源已通过https://mathnet.mit.edu 公开发布。
English
Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and task diversity. We introduce MathNet, a high-quality, large-scale, multimodal, and multilingual dataset of Olympiad-level math problems together with a benchmark for evaluating mathematical reasoning in generative models and mathematical retrieval in embedding-based systems. MathNet spans 47 countries, 17 languages, and two decades of competitions, comprising 30,676 expert-authored problems with solutions across diverse domains. In addition to the core dataset, we construct a retrieval benchmark consisting of mathematically equivalent and structurally similar problem pairs curated by human experts. MathNet supports three tasks: (i) Problem Solving, (ii) Math-Aware Retrieval, and (iii) Retrieval-Augmented Problem Solving. Experimental results show that even state-of-the-art reasoning models (78.4% for Gemini-3.1-Pro and 69.3% for GPT-5) remain challenged, while embedding models struggle to retrieve equivalent problems. We further show that retrieval-augmented generation performance is highly sensitive to retrieval quality; for example, DeepSeek-V3.2-Speciale achieves gains of up to 12%, obtaining the highest scores on the benchmark. MathNet provides the largest high-quality Olympiad dataset together with the first benchmark for evaluating mathematical problem retrieval, and we publicly release both the dataset and benchmark at https://mathnet.mit.edu.
PDF71April 22, 2026