PivotRL: 낮은 컴퓨팅 비용으로 높은 정확도를 달성하는 에이전트 사후 학습 기법

초록

장기 과제 수행 에이전트의 사후 훈련에서는 계산 효율성과 일반화 사이에 긴장 관계가 존재합니다. 지도 미세 조정(SFT)은 계산 효율성이 뛰어나지만, 도메인 외(OOD) 환경에서 성능 저하가 자주 발생합니다. 반면, 종단간 강화 학습(E2E RL)은 OOD 능력을 유지하지만, 다수의 온-정책 롤아웃 단계로 인해 높은 계산 비용이 수반됩니다. 본 연구에서는 기존 SFT 궤적을 활용하여 SFT의 계산 효율성과 E2E RL의 OOD 정확도를 결합하는 새로운 프레임워크인 PivotRL을 소개합니다. PivotRL은 두 가지 핵심 메커니즘에 기반합니다: 첫째, 지역적 온-정책 롤아웃을 실행하고 샘플링된 행동의 결과 변동성이 높은 정보적 중간 단계인 피벗을 선별합니다; 둘째, SFT 시범 데이터와의 엄격한 문자열 일치를 요구하기보다 기능적으로 동등한 행동에 대한 보상을 활용합니다. 이론적으로 이러한 메커니즘이 높은 자연 기울기 노름을 지닌 강력한 학습 신호를 장려하면서도, 훈련 과제와 무관한 행동에 대한 정책 확률 순서를 최대한 보존함을 입증합니다. 동일한 데이터에 대한 표준 SFT 대비 PivotRL은 4개 에이전트 도메인에서 평균 +4.17% 높은 도메인 내 정확도와 비에이전트 과제에서 +10.04% 높은 OOD 정확도를 달성했습니다. 특히 에이전트 코딩 과제에서 PivotRL은 E2E RL 대비 롤아웃 단계를 4분의 1로 줄이면서도 경쟁력 있는 정확도를 보였습니다. PivotRL은 NVIDIA의 Nemotron-3-Super-120B-A12B에 채택되어 프로덕션 규모의 에이전트 사후 훈련 핵심 기술로 활용되고 있습니다.

English

Post-training for long-horizon agentic tasks has a tension between compute efficiency and generalization. While supervised fine-tuning (SFT) is compute efficient, it often suffers from out-of-domain (OOD) degradation. Conversely, end-to-end reinforcement learning (E2E RL) preserves OOD capabilities, but incurs high compute costs due to many turns of on-policy rollout. We introduce PivotRL, a novel framework that operates on existing SFT trajectories to combine the compute efficiency of SFT with the OOD accuracy of E2E RL. PivotRL relies on two key mechanisms: first, it executes local, on-policy rollouts and filters for pivots: informative intermediate turns where sampled actions exhibit high variance in outcomes; second, it utilizes rewards for functional-equivalent actions rather than demanding strict string matching with the SFT data demonstration. We theoretically show that these mechanisms incentivize strong learning signals with high natural gradient norm, while maximally preserving policy probability ordering on actions unrelated to training tasks. In comparison to standard SFT on identical data, we demonstrate that PivotRL achieves +4.17% higher in-domain accuracy on average across four agentic domains, and +10.04% higher OOD accuracy in non-agentic tasks. Notably, on agentic coding tasks, PivotRL achieves competitive accuracy with E2E RL with 4x fewer rollout turns. PivotRL is adopted by NVIDIA's Nemotron-3-Super-120B-A12B, acting as the workhorse in production-scale agentic post-training.

PivotRL: 낮은 컴퓨팅 비용으로 높은 정확도를 달성하는 에이전트 사후 학습 기법

PivotRL: High Accuracy Agentic Post-Training at Low Compute Cost

초록

Support