rStar2-Agent:智能体推理技术报告
rStar2-Agent: Agentic Reasoning Technical Report
August 28, 2025
作者: Ning Shang, Yifei Liu, Yi Zhu, Li Lyna Zhang, Weijiang Xu, Xinyu Guan, Buze Zhang, Bingcheng Dong, Xudong Zhou, Bowen Zhang, Ying Xin, Ziming Miao, Scarlett Li, Fan Yang, Mao Yang
cs.AI
摘要
我们推出了rStar2-Agent,这是一个拥有140亿参数的数学推理模型,通过智能体强化学习训练,旨在实现前沿性能。该模型不仅超越了当前的长链思维(CoT)方法,还展现了高级认知行为,例如在使用Python编码工具前深思熟虑,并通过代码执行反馈进行反思,以自主探索、验证和优化复杂问题解决中的中间步骤。这一能力得益于三项关键创新,使得智能体强化学习在大规模应用中更为有效:(i)一个高效的强化学习基础设施,配备可靠的Python代码环境,支持高吞吐量执行并降低高额rollout成本,使得在有限的GPU资源(64个MI300X GPU)上进行训练成为可能;(ii)GRPO-RoC,一种采用“正确时重采样”rollout策略的智能体强化学习算法,有效应对编码工具带来的环境噪声,使模型在代码环境中推理更为高效;(iii)一套高效的智能体训练方案,从非推理的监督微调(SFT)起步,逐步过渡到多阶段强化学习,以最低的计算成本培养出高级认知能力。由此,rStar2-Agent仅用一周时间,通过510步强化学习,便将一个预训练的140亿参数模型提升至业界领先水平,在AIME24和AIME25上分别取得了80.6%和69.8%的平均pass@1分数,以显著更短的响应时间超越了DeepSeek-R1(6710亿参数)。此外,rStar2-Agent-14B在数学之外,也展现了对对齐任务、科学推理及智能体工具使用的强大泛化能力。代码及训练方案已发布于https://github.com/microsoft/rStar。
English
We introduce rStar2-Agent, a 14B math reasoning model trained with agentic
reinforcement learning to achieve frontier-level performance. Beyond current
long CoT, the model demonstrates advanced cognitive behaviors, such as thinking
carefully before using Python coding tools and reflecting on code execution
feedback to autonomously explore, verify, and refine intermediate steps in
complex problem-solving. This capability is enabled through three key
innovations that makes agentic RL effective at scale: (i) an efficient RL
infrastructure with a reliable Python code environment that supports
high-throughput execution and mitigates the high rollout costs, enabling
training on limited GPU resources (64 MI300X GPUs); (ii) GRPO-RoC, an agentic
RL algorithm with a Resample-on-Correct rollout strategy that addresses the
inherent environment noises from coding tools, allowing the model to reason
more effectively in a code environment; (iii) An efficient agent training
recipe that starts with non-reasoning SFT and progresses through multi-RL
stages, yielding advanced cognitive abilities with minimal compute cost. To
this end, rStar2-Agent boosts a pre-trained 14B model to state of the art in
only 510 RL steps within one week, achieving average pass@1 scores of 80.6% on
AIME24 and 69.8% on AIME25, surpassing DeepSeek-R1 (671B) with significantly
shorter responses. Beyond mathematics, rStar2-Agent-14B also demonstrates
strong generalization to alignment, scientific reasoning, and agentic tool-use
tasks. Code and training recipes are available at
https://github.com/microsoft/rStar.