PivotRL：以低计算成本实现高精度智能体微调

摘要

针对长周期智能体任务的后训练存在计算效率与泛化能力之间的权衡。监督微调虽计算高效，但常出现域外性能衰退；而端到端强化学习虽能保持域外能力，却因多轮策略 rollout 带来高昂计算成本。我们提出PivotRL创新框架，该框架基于现有监督微调轨迹运行，兼具监督微调的计算效率与端到端强化学习的域外准确性。PivotRL依赖两大核心机制：首先执行局部策略rollout并筛选关键转折点——即当采样行为在结果中呈现高方差的信息化中间节点；其次采用功能等价行为的奖励机制，而非要求与监督微调演示数据严格字符串匹配。理论证明这些机制能激励具有高自然梯度范数的强学习信号，同时最大限度保持与训练任务无关行为的策略概率顺序。在相同数据上的实验表明，PivotRL在四个智能体领域平均实现+4.17%的域内准确率提升，在非智能体任务中域外准确率提升达+10.04%。值得注意的是，在智能体编程任务中，PivotRL仅用端到端强化学习1/4的rollout轮数即达到与之相当的准确率。该框架已被英伟达Nemotron-3-Super-120B-A12B模型采用，成为生产级智能体后训练的核心技术。

English

Post-training for long-horizon agentic tasks has a tension between compute efficiency and generalization. While supervised fine-tuning (SFT) is compute efficient, it often suffers from out-of-domain (OOD) degradation. Conversely, end-to-end reinforcement learning (E2E RL) preserves OOD capabilities, but incurs high compute costs due to many turns of on-policy rollout. We introduce PivotRL, a novel framework that operates on existing SFT trajectories to combine the compute efficiency of SFT with the OOD accuracy of E2E RL. PivotRL relies on two key mechanisms: first, it executes local, on-policy rollouts and filters for pivots: informative intermediate turns where sampled actions exhibit high variance in outcomes; second, it utilizes rewards for functional-equivalent actions rather than demanding strict string matching with the SFT data demonstration. We theoretically show that these mechanisms incentivize strong learning signals with high natural gradient norm, while maximally preserving policy probability ordering on actions unrelated to training tasks. In comparison to standard SFT on identical data, we demonstrate that PivotRL achieves +4.17% higher in-domain accuracy on average across four agentic domains, and +10.04% higher OOD accuracy in non-agentic tasks. Notably, on agentic coding tasks, PivotRL achieves competitive accuracy with E2E RL with 4x fewer rollout turns. PivotRL is adopted by NVIDIA's Nemotron-3-Super-120B-A12B, acting as the workhorse in production-scale agentic post-training.