ResearchMath-14K: 에이전트를 통한 연구급 수학의 확장

초록

수학의 최전선은 아직 해결책이 알려지지 않은 문제들로 정의되지만, 언어 모델이 인간의 개입 없이 그러한 문제들에 의미 있게 접근할 수 있는지는 여전히 불분명하다. 주요 장애물은 대규모 연구 수준의 수학 데이터셋이 부족하다는 점이다. 이를 위해 우리는 다중 에이전트 파이프라인을 통해 학술 출처에서 선별한 14,056개의 문제로 구성된 ResearchMath-14k를 소개하며, 이는 현재까지 가장 큰 연구 수준의 수학 문제 모음집이다. 또한 두 개의 오픈 모델에서 220K개의 교사 궤적(teacher trajectories)으로 구성된 ResearchMath-Reasoning을 생성했으며, 여기서 시도하지 않음(non-attempts) 및 조작된 참조(fabricated references)와 같은 반복적인 회피 행동을 관찰했다. 흥미롭게도, 8개의 오픈 가중치(open-weight) 모델에서 최신 세대는 추적(trace)당 5.6배 더 많은 참조와 5.0배 더 많은 가짜 참조를 생성한다. ResearchMath-Reasoning에 대한 에이전틱 필터링(agentic filtering) 후, 4B에서 30B 파라미터까지의 Qwen3 모델을 미세 조정(fine-tuning)하면 기본 모델 대비 평균 9.2포인트 향상된다. 이는 필터링된 미해결 문제 시도가 완전히 올바른 추론 궤적 없이도 유용한 지도(supervision)를 제공할 수 있음을 보여준다. 연구 수준의 수학적 추론에 대한 향후 연구를 위해 ResearchMath-14k를 공개한다.

English

The frontier of mathematics is defined by problems whose solutions are not yet known, yet it remains unclear whether language models can meaningfully engage with such problems without human intervention. A major obstacle is the lack of large-scale research-level math datasets. To this end, we introduce ResearchMath-14k, a set of 14{,}056 problems curated from academic sources via a multi-agent pipeline, making it the largest collection of research-level mathematical problems to date. We further generate ResearchMath-Reasoning, 220K teacher trajectories from two open models, where we observe recurring avoidance behaviors such as non-attempts and fabricated references. Interestingly, across eight open-weight models, newer generations produce 5.6times more references and 5.0times more fake references per trace. After agentic filtering of ResearchMath-Reasoning, fine-tuning Qwen3 models from 4B to 30B parameters improves over base models by 9.2 points on average. This shows that filtered open-problem attempts can provide useful supervision even without fully correct reasoning traces. We make ResearchMath-14k publicly available for future works on research-level mathematical reasoning.