rStar2-Agent:智能体推理技術報告
rStar2-Agent: Agentic Reasoning Technical Report
August 28, 2025
作者: Ning Shang, Yifei Liu, Yi Zhu, Li Lyna Zhang, Weijiang Xu, Xinyu Guan, Buze Zhang, Bingcheng Dong, Xudong Zhou, Bowen Zhang, Ying Xin, Ziming Miao, Scarlett Li, Fan Yang, Mao Yang
cs.AI
摘要
我們推出了rStar2-Agent,這是一個擁有140億參數的數學推理模型,通過代理強化學習訓練,以達到前沿水平的性能。與當前冗長的思維鏈(CoT)方法相比,該模型展示了高級的認知行為,例如在使用Python編碼工具前深思熟慮,並根據代碼執行反饋進行反思,從而自主探索、驗證和精煉複雜問題解決中的中間步驟。這一能力得益於三項關鍵創新,使得代理強化學習在大規模應用中更加有效:(i)一個高效的強化學習基礎設施,配備可靠的Python代碼環境,支持高吞吐量執行並降低高額的rollout成本,使得在有限的GPU資源(64個MI300X GPU)上進行訓練成為可能;(ii)GRPO-RoC,一種採用“正確時重採樣”rollout策略的代理強化學習算法,有效應對編碼工具帶來的環境噪聲,使模型在代碼環境中更有效地推理;(iii)一個高效的代理訓練方案,從非推理的監督微調(SFT)開始,逐步過渡到多階段強化學習,以最小的計算成本培養高級認知能力。基於此,rStar2-Agent僅用510步強化學習訓練,在一周內將一個預訓練的140億參數模型提升至業界領先水平,在AIME24和AIME25上分別取得了80.6%和69.8%的平均pass@1分數,超越了DeepSeek-R1(671B),且響應時間顯著縮短。除了數學領域,rStar2-Agent-14B在對齊、科學推理及代理工具使用任務上也展現出強大的泛化能力。代碼及訓練方案已開源於https://github.com/microsoft/rStar。
English
We introduce rStar2-Agent, a 14B math reasoning model trained with agentic
reinforcement learning to achieve frontier-level performance. Beyond current
long CoT, the model demonstrates advanced cognitive behaviors, such as thinking
carefully before using Python coding tools and reflecting on code execution
feedback to autonomously explore, verify, and refine intermediate steps in
complex problem-solving. This capability is enabled through three key
innovations that makes agentic RL effective at scale: (i) an efficient RL
infrastructure with a reliable Python code environment that supports
high-throughput execution and mitigates the high rollout costs, enabling
training on limited GPU resources (64 MI300X GPUs); (ii) GRPO-RoC, an agentic
RL algorithm with a Resample-on-Correct rollout strategy that addresses the
inherent environment noises from coding tools, allowing the model to reason
more effectively in a code environment; (iii) An efficient agent training
recipe that starts with non-reasoning SFT and progresses through multi-RL
stages, yielding advanced cognitive abilities with minimal compute cost. To
this end, rStar2-Agent boosts a pre-trained 14B model to state of the art in
only 510 RL steps within one week, achieving average pass@1 scores of 80.6% on
AIME24 and 69.8% on AIME25, surpassing DeepSeek-R1 (671B) with significantly
shorter responses. Beyond mathematics, rStar2-Agent-14B also demonstrates
strong generalization to alignment, scientific reasoning, and agentic tool-use
tasks. Code and training recipes are available at
https://github.com/microsoft/rStar.