ResearchMath-14K：透過智能體擴展研究級數學

摘要

數學的前沿由尚未有解的難題所定義，然而，語言模型能否在沒有人為介入的情況下有意義地處理此類問題仍不清楚。一個主要障礙是缺乏大規模的研究級數學數據集。為此，我們推出 ResearchMath-14k，這是由 14,056 道問題組成的數據集，經由多代理人管線從學術來源篩選而成，是迄今規模最大的研究級數學問題集。我們進一步生成了 ResearchMath-Reasoning，包含來自兩個開放模型的 22 萬條教師軌跡，在其中我們觀察到反覆出現的迴避行為，例如未嘗試作答與虛構參考文獻。有趣的是，在八個開放權重模型中，新一代模型每條軌跡產生的參考文獻數量增加 5.6 倍，且虛構參考文獻數量增加 5.0 倍。在對 ResearchMath-Reasoning 進行代理人過濾後，對參數規模從 4B 到 30B 的 Qwen3 模型進行微調，其平均表現比基礎模型提升了 9.2 個百分點。這表明，即使沒有完全正確的推理軌跡，經過過濾的開放問題嘗試仍可提供有用的監督信號。我們公開提供 ResearchMath-14k，以供未來研究級數學推理相關工作使用。

English

The frontier of mathematics is defined by problems whose solutions are not yet known, yet it remains unclear whether language models can meaningfully engage with such problems without human intervention. A major obstacle is the lack of large-scale research-level math datasets. To this end, we introduce ResearchMath-14k, a set of 14{,}056 problems curated from academic sources via a multi-agent pipeline, making it the largest collection of research-level mathematical problems to date. We further generate ResearchMath-Reasoning, 220K teacher trajectories from two open models, where we observe recurring avoidance behaviors such as non-attempts and fabricated references. Interestingly, across eight open-weight models, newer generations produce 5.6times more references and 5.0times more fake references per trace. After agentic filtering of ResearchMath-Reasoning, fine-tuning Qwen3 models from 4B to 30B parameters improves over base models by 9.2 points on average. This shows that filtered open-problem attempts can provide useful supervision even without fully correct reasoning traces. We make ResearchMath-14k publicly available for future works on research-level mathematical reasoning.