rStar2-Agent: エージェント推論技術レポート

要旨

我々は、エージェント型強化学習を用いて訓練された14Bパラメータの数学推論モデル「rStar2-Agent」を紹介する。このモデルは最先端の性能を達成するだけでなく、現在の長いChain-of-Thought（CoT）を超え、Pythonコーディングツールを使用する前に慎重に思考し、コード実行のフィードバックを反映して自律的に探索、検証、および複雑な問題解決の中間ステップを洗練させるといった高度な認知行動を示す。この能力は、以下の3つの主要なイノベーションによって実現されている：(i) 高スループット実行をサポートし、高コストなロールアウトを軽減する信頼性の高いPythonコード環境を備えた効率的なRLインフラストラクチャ。これにより、限られたGPUリソース（64 MI300X GPU）での訓練が可能となる。(ii) GRPO-RoC。コーディングツールからの環境ノイズに対処するResample-on-Correctロールアウト戦略を採用したエージェント型RLアルゴリズム。これにより、モデルはコード環境でより効果的に推論できる。(iii) 非推論型SFTから始まり、複数のRLステージを経て進化する効率的なエージェント訓練レシピ。これにより、最小限の計算コストで高度な認知能力を獲得する。その結果、rStar2-Agentは事前訓練済みの14Bモデルをわずか510 RLステップで1週間以内に最先端に引き上げ、AIME24で80.6%、AIME25で69.8%の平均pass@1スコアを達成し、DeepSeek-R1（671B）を大幅に短い応答で上回った。数学を超えて、rStar2-Agent-14Bはアラインメント、科学的推論、およびエージェント型ツール使用タスクへの強い汎化能力も示す。コードと訓練レシピはhttps://github.com/microsoft/rStarで公開されている。

English

We introduce rStar2-Agent, a 14B math reasoning model trained with agentic reinforcement learning to achieve frontier-level performance. Beyond current long CoT, the model demonstrates advanced cognitive behaviors, such as thinking carefully before using Python coding tools and reflecting on code execution feedback to autonomously explore, verify, and refine intermediate steps in complex problem-solving. This capability is enabled through three key innovations that makes agentic RL effective at scale: (i) an efficient RL infrastructure with a reliable Python code environment that supports high-throughput execution and mitigates the high rollout costs, enabling training on limited GPU resources (64 MI300X GPUs); (ii) GRPO-RoC, an agentic RL algorithm with a Resample-on-Correct rollout strategy that addresses the inherent environment noises from coding tools, allowing the model to reason more effectively in a code environment; (iii) An efficient agent training recipe that starts with non-reasoning SFT and progresses through multi-RL stages, yielding advanced cognitive abilities with minimal compute cost. To this end, rStar2-Agent boosts a pre-trained 14B model to state of the art in only 510 RL steps within one week, achieving average pass@1 scores of 80.6% on AIME24 and 69.8% on AIME25, surpassing DeepSeek-R1 (671B) with significantly shorter responses. Beyond mathematics, rStar2-Agent-14B also demonstrates strong generalization to alignment, scientific reasoning, and agentic tool-use tasks. Code and training recipes are available at https://github.com/microsoft/rStar.

rStar2-Agent: エージェント推論技術レポート

rStar2-Agent: Agentic Reasoning Technical Report

要旨

Support