PivotRL:低计算成本下实现高精度的智能体微调后优化
PivotRL: High Accuracy Agentic Post-Training at Low Compute Cost
March 22, 2026
作者: Junkeun Yi, Damon Mosk-Aoyama, Baihe Huang, Ritu Gala, Charles Wang, Sugam Dipak Devare, Khushi Bhardwaj, Abhibha Gupta, Oleksii Kuchaiev, Jiantao Jiao, Jian Zhang, Venkat Srinivasan
cs.AI
摘要
针对长视野智能体任务的后训练存在计算效率与泛化能力之间的张力。监督微调虽具计算效率,却常受域外性能退化困扰;反之,端到端强化学习虽能保持域外能力,但因需多轮同策略 rollout 而计算成本高昂。我们提出PivotRL创新框架,该框架基于现有SFT轨迹运行,兼具SFT的计算效率与E2E RL的域外准确性。其核心机制包括:首先执行局部同策略rollout并筛选关键转折点——即采样行动在结果中呈现高方差的强信息量中间轮次;其次采用功能等效行动奖励机制,而非苛求与SFT演示数据的严格字符串匹配。理论分析表明,这些机制能激励具有高自然梯度范数的强学习信号,同时最大限度保持与训练任务无关行动的策略概率顺序。在相同数据上的实验显示,PivotRL在四个智能体领域平均实现+4.17%的域内准确率提升,在非智能体任务中域外准确率提升达+10.04%。值得注意的是,在智能体编程任务中,PivotRL仅需E2E RL四分之一轮次的rollout即可达到相当精度。该框架已被英伟达Nemotron-3-Super-120B-A12B模型采用,成为生产级智能体后训练的核心技术。
English
Post-training for long-horizon agentic tasks has a tension between compute efficiency and generalization. While supervised fine-tuning (SFT) is compute efficient, it often suffers from out-of-domain (OOD) degradation. Conversely, end-to-end reinforcement learning (E2E RL) preserves OOD capabilities, but incurs high compute costs due to many turns of on-policy rollout. We introduce PivotRL, a novel framework that operates on existing SFT trajectories to combine the compute efficiency of SFT with the OOD accuracy of E2E RL. PivotRL relies on two key mechanisms: first, it executes local, on-policy rollouts and filters for pivots: informative intermediate turns where sampled actions exhibit high variance in outcomes; second, it utilizes rewards for functional-equivalent actions rather than demanding strict string matching with the SFT data demonstration. We theoretically show that these mechanisms incentivize strong learning signals with high natural gradient norm, while maximally preserving policy probability ordering on actions unrelated to training tasks. In comparison to standard SFT on identical data, we demonstrate that PivotRL achieves +4.17% higher in-domain accuracy on average across four agentic domains, and +10.04% higher OOD accuracy in non-agentic tasks. Notably, on agentic coding tasks, PivotRL achieves competitive accuracy with E2E RL with 4x fewer rollout turns. PivotRL is adopted by NVIDIA's Nemotron-3-Super-120B-A12B, acting as the workhorse in production-scale agentic post-training.