MathNet：面向数学推理与检索的全球多模态基准

摘要

数学解题能力始终是检验大型语言与多模态模型推理能力的挑战性任务，然而现有基准数据集在规模、语言覆盖度和任务多样性方面存在局限。我们推出MathNet——一个高质量、大规模、多模态、多语言的奥数级数学题库，同时提供用于评估生成模型数学推理能力与嵌入系统数学检索性能的基准测试。该数据集涵盖47个国家、17种语言及近二十年的竞赛题目，包含30,676道专家编写的多领域题目及解答。除核心数据集外，我们还构建了由专家标注的数学等价题与结构相似题对组成的检索基准。 MathNet支持三大任务：（一）数学解题（二）数学感知检索（三）检索增强解题。实验表明，即使最先进的推理模型（Gemini-3.1-Pro达78.4%，GPT-5达69.3%）仍面临挑战，而嵌入模型在检索等价题目时表现不佳。我们进一步发现检索增强生成性能对检索质量高度敏感：例如DeepSeek-V3.2-Speciale通过优质检索实现最高12%的性能提升，创下基准测试最佳成绩。MathNet不仅提供规模最大的高质量奥数数据集，更开创了数学题目检索评估基准，相关资源已通过https://mathnet.mit.edu 公开释放。

English

Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and task diversity. We introduce MathNet, a high-quality, large-scale, multimodal, and multilingual dataset of Olympiad-level math problems together with a benchmark for evaluating mathematical reasoning in generative models and mathematical retrieval in embedding-based systems. MathNet spans 47 countries, 17 languages, and two decades of competitions, comprising 30,676 expert-authored problems with solutions across diverse domains. In addition to the core dataset, we construct a retrieval benchmark consisting of mathematically equivalent and structurally similar problem pairs curated by human experts. MathNet supports three tasks: (i) Problem Solving, (ii) Math-Aware Retrieval, and (iii) Retrieval-Augmented Problem Solving. Experimental results show that even state-of-the-art reasoning models (78.4% for Gemini-3.1-Pro and 69.3% for GPT-5) remain challenged, while embedding models struggle to retrieve equivalent problems. We further show that retrieval-augmented generation performance is highly sensitive to retrieval quality; for example, DeepSeek-V3.2-Speciale achieves gains of up to 12%, obtaining the highest scores on the benchmark. MathNet provides the largest high-quality Olympiad dataset together with the first benchmark for evaluating mathematical problem retrieval, and we publicly release both the dataset and benchmark at https://mathnet.mit.edu.