PivotRL: 低計算コストで高精度なエージェント的ポストトレーニング

要旨

長期的なエージェントタスクにおけるポストトレーニングでは、計算効率と汎化性能の間に緊張関係が存在する。教師ありファインチューニング（SFT）は計算効率が高いが、ドメイン外（OOD）での性能低下に悩まされることが多い。逆に、エンドツーエンド強化学習（E2E RL）はOOD性能を維持するが、多数のオンポリシーロールアウトによる高い計算コストが発生する。本論文では、既存のSFT軌道上で動作し、SFTの計算効率とE2E RLのOOD精度を組み合わせる新規フレームワーク「PivotRL」を提案する。PivotRLは二つの主要メカニズムに依存する。第一に、局所的かつオンポリシーなロールアウトを実行し、サンプリングされたアクションの結果に高い分散が現れる情報豊富な中間ターンである「ピボット」をフィルタリングする。第二に、SFTデータのデモンストレーションとの厳密な文字列一致を要求するのではなく、機能的に等価なアクションに対して報酬を利用する。理論的に、これらのメカニズムが高い自然勾配ノルムを持つ強力な学習信号を促進しつつ、トレーニングタスクに関連しないアクションに対する方策確率の順序を最大限に維持することを示す。同一データでの標準SFTと比較して、PivotRLは4つのエージェント領域で平均して+4.17%高いドメイン内精度を、非エージェントタスクでは+10.04%高いOOD精度を達成することを実証する。特に、エージェント的コーディングタスクでは、PivotRLはE2E RLと同等の精度を、ロールアウトターン数を4分の1に抑えて達成する。PivotRLはNVIDIAのNemotron-3-Super-120B-A12Bで採用され、生産規模のエージェント的ポストトレーニングにおける主力技術として機能している。

English

Post-training for long-horizon agentic tasks has a tension between compute efficiency and generalization. While supervised fine-tuning (SFT) is compute efficient, it often suffers from out-of-domain (OOD) degradation. Conversely, end-to-end reinforcement learning (E2E RL) preserves OOD capabilities, but incurs high compute costs due to many turns of on-policy rollout. We introduce PivotRL, a novel framework that operates on existing SFT trajectories to combine the compute efficiency of SFT with the OOD accuracy of E2E RL. PivotRL relies on two key mechanisms: first, it executes local, on-policy rollouts and filters for pivots: informative intermediate turns where sampled actions exhibit high variance in outcomes; second, it utilizes rewards for functional-equivalent actions rather than demanding strict string matching with the SFT data demonstration. We theoretically show that these mechanisms incentivize strong learning signals with high natural gradient norm, while maximally preserving policy probability ordering on actions unrelated to training tasks. In comparison to standard SFT on identical data, we demonstrate that PivotRL achieves +4.17% higher in-domain accuracy on average across four agentic domains, and +10.04% higher OOD accuracy in non-agentic tasks. Notably, on agentic coding tasks, PivotRL achieves competitive accuracy with E2E RL with 4x fewer rollout turns. PivotRL is adopted by NVIDIA's Nemotron-3-Super-120B-A12B, acting as the workhorse in production-scale agentic post-training.

PivotRL: 低計算コストで高精度なエージェント的ポストトレーニング

PivotRL: High Accuracy Agentic Post-Training at Low Compute Cost

要旨

Support