ResearchMath-14K: 通过智能体扩展研究级数学

摘要

数学前沿由那些尚未有解的难题所界定，但目前尚不清楚语言模型能否在没有人类干预的情况下有意义地应对这类问题。一个主要障碍是缺乏大规模的研究级数学数据集。为此，我们引入了 ResearchMath-14k，这是一个通过多智能体流水线从学术资源中精选出 14,056 道题目的集合，使其成为迄今为止规模最大的研究级数学问题数据集。我们进一步生成了 ResearchMath-Reasoning，即来自两个开放模型的 22 万条教师轨迹，其中我们观察到诸如未尝试和编造引用等反复出现的回避行为。有趣的是，在八个开放权重模型中，新一代模型每条轨迹产生的引用数量是之前的 5.6 倍，虚假引用数量是之前的 5.0 倍。在对 ResearchMath-Reasoning 进行智能体过滤后，对 4B 至 30B 参数的 Qwen3 模型进行微调，其平均性能比基础模型提高了 9.2 个点。这表明，即使没有完全正确的推理轨迹，经过筛选的开放问题尝试也能提供有用的监督。我们公开发布 ResearchMath-14k，以供未来研究级数学推理相关工作使用。

English

The frontier of mathematics is defined by problems whose solutions are not yet known, yet it remains unclear whether language models can meaningfully engage with such problems without human intervention. A major obstacle is the lack of large-scale research-level math datasets. To this end, we introduce ResearchMath-14k, a set of 14{,}056 problems curated from academic sources via a multi-agent pipeline, making it the largest collection of research-level mathematical problems to date. We further generate ResearchMath-Reasoning, 220K teacher trajectories from two open models, where we observe recurring avoidance behaviors such as non-attempts and fabricated references. Interestingly, across eight open-weight models, newer generations produce 5.6times more references and 5.0times more fake references per trace. After agentic filtering of ResearchMath-Reasoning, fine-tuning Qwen3 models from 4B to 30B parameters improves over base models by 9.2 points on average. This shows that filtered open-problem attempts can provide useful supervision even without fully correct reasoning traces. We make ResearchMath-14k publicly available for future works on research-level mathematical reasoning.