rStar2-Agent: 에이전트 추론 기술 보고서

초록

우리는 최첨단 수준의 성능을 달성하기 위해 에이전트 강화 학습(agentic reinforcement learning)으로 훈련된 14B 규모의 수학 추론 모델인 rStar2-Agent를 소개한다. 이 모델은 기존의 긴 사고 연쇄(CoT)를 넘어서, Python 코딩 도구를 사용하기 전에 신중하게 생각하고 코드 실행 피드백을 반영하여 복잡한 문제 해결 과정에서 중간 단계를 자율적으로 탐색, 검증 및 개선하는 고급 인지 행동을 보여준다. 이러한 능력은 대규모에서 에이전트 강화 학습을 효과적으로 만드는 세 가지 핵심 혁신을 통해 가능해졌다: (i) 높은 처리량 실행을 지원하고 롤아웃 비용을 줄이는 신뢰할 수 있는 Python 코드 환경을 갖춘 효율적인 강화 학습 인프라로, 제한된 GPU 자원(64개의 MI300X GPU)에서도 훈련이 가능하다; (ii) 코딩 도구에서 발생하는 환경 노이즈를 해결하기 위해 Resample-on-Correct 롤아웃 전략을 적용한 GRPO-RoC 알고리즘으로, 코드 환경에서 모델이 더 효과적으로 추론할 수 있도록 한다; (iii) 비추론 지도 학습(SFT)으로 시작하여 다단계 강화 학습을 거치는 효율적인 에이전트 훈련 레시피로, 최소의 계산 비용으로 고급 인지 능력을 얻을 수 있다. 이를 통해 rStar2-Agent는 사전 훈련된 14B 모델을 단 510번의 강화 학습 단계와 일주일 만에 최첨단 수준으로 끌어올렸으며, AIME24에서 80.6%, AIME25에서 69.8%의 평균 pass@1 점수를 달성하여 DeepSeek-R1(671B)을 더 짧은 응답 길이로 능가했다. 수학을 넘어서, rStar2-Agent-14B는 정렬(alignment), 과학적 추론, 에이전트 도구 사용 작업에서도 강력한 일반화 능력을 보여준다. 코드와 훈련 레시피는 https://github.com/microsoft/rStar에서 확인할 수 있다.

English

We introduce rStar2-Agent, a 14B math reasoning model trained with agentic reinforcement learning to achieve frontier-level performance. Beyond current long CoT, the model demonstrates advanced cognitive behaviors, such as thinking carefully before using Python coding tools and reflecting on code execution feedback to autonomously explore, verify, and refine intermediate steps in complex problem-solving. This capability is enabled through three key innovations that makes agentic RL effective at scale: (i) an efficient RL infrastructure with a reliable Python code environment that supports high-throughput execution and mitigates the high rollout costs, enabling training on limited GPU resources (64 MI300X GPUs); (ii) GRPO-RoC, an agentic RL algorithm with a Resample-on-Correct rollout strategy that addresses the inherent environment noises from coding tools, allowing the model to reason more effectively in a code environment; (iii) An efficient agent training recipe that starts with non-reasoning SFT and progresses through multi-RL stages, yielding advanced cognitive abilities with minimal compute cost. To this end, rStar2-Agent boosts a pre-trained 14B model to state of the art in only 510 RL steps within one week, achieving average pass@1 scores of 80.6% on AIME24 and 69.8% on AIME25, surpassing DeepSeek-R1 (671B) with significantly shorter responses. Beyond mathematics, rStar2-Agent-14B also demonstrates strong generalization to alignment, scientific reasoning, and agentic tool-use tasks. Code and training recipes are available at https://github.com/microsoft/rStar.

rStar2-Agent: 에이전트 추론 기술 보고서

rStar2-Agent: Agentic Reasoning Technical Report

초록

Support