ResearchMath-14K: エージェントによる研究レベルの数学のスケーリング

要旨

数学の最前線は未解決問題によって定義されるが、言語モデルが人間の介入なしにそのような問題に有意義に取り組めるかは依然として不明である。大きな障壁の一つは、大規模な研究レベルの数学データセットが存在しないことである。この目的のために、我々はResearchMath-14kを導入する。これは、マルチエージェントパイプラインを介して学術ソースから収集された14,056問の問題からなるデータセットであり、研究レベルの数学問題のコレクションとしては現在最大規模である。さらに、ResearchMath-Reasoning（2つのオープンモデルからの220Kの教師軌跡）を生成し、その中で、未着手や捏造された参考文献といった頻発する回避行動を観察した。興味深いことに、8つのオープンウェイトモデルにおいて、新しい世代のモデルは軌跡あたり5.6倍多くの参考文献と5.0倍多くの偽の参考文献を生成する。ResearchMath-Reasoningのエージェンティックフィルタリング後、4Bから30BパラメータのQwen3モデルをファインチューニングすると、ベースモデルに対して平均9.2ポイントの改善が得られた。これは、完全に正しい推論軌跡がなくても、フィルタリングされた未解決問題への試みが有用な教師信号を提供できることを示している。我々はResearchMath-14kを公開し、研究レベルの数学的推論に関する将来の研究に供する。

English

The frontier of mathematics is defined by problems whose solutions are not yet known, yet it remains unclear whether language models can meaningfully engage with such problems without human intervention. A major obstacle is the lack of large-scale research-level math datasets. To this end, we introduce ResearchMath-14k, a set of 14{,}056 problems curated from academic sources via a multi-agent pipeline, making it the largest collection of research-level mathematical problems to date. We further generate ResearchMath-Reasoning, 220K teacher trajectories from two open models, where we observe recurring avoidance behaviors such as non-attempts and fabricated references. Interestingly, across eight open-weight models, newer generations produce 5.6times more references and 5.0times more fake references per trace. After agentic filtering of ResearchMath-Reasoning, fine-tuning Qwen3 models from 4B to 30B parameters improves over base models by 9.2 points on average. This shows that filtered open-problem attempts can provide useful supervision even without fully correct reasoning traces. We make ResearchMath-14k publicly available for future works on research-level mathematical reasoning.