ChatPaper.aiChatPaper

异步智能体强化学习中的旧logits缺失:异策略校正的语义不匹配与修复方法

Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

May 12, 2026
作者: Zhong Guan, Yongjian Guo, Haoran Sun, Wen Huang, Shuai Di, Xiong Jun Wu, Likang Wu, Hongke Zhao
cs.AI

摘要

异步强化学习通过将样本生成与策略优化解耦,提升了大型语言模型智能体的推理吞吐量,但同时也为PPO风格的离策略修正引入了一种关键失效模式。在异构训练系统中,总重要性比应当被分解为两个语义上不同的因子:一个是训练-推理差异项,用于对齐同一行为策略版本下推理侧与训练侧的分布;另一个是策略陈旧性项,用于约束从历史策略向当前策略的更新。我们证明,在实际的异步流水线中,延迟更新和部分展开往往导致所需的训练侧历史logits(即旧logits)丢失。这种缺失旧logits的问题会将差异修复与陈旧性修正纠缠在一起,破坏解耦修正的预期语义,并使裁剪和掩码阈值产生不良交互。为解决该问题,我们研究了精确修正和近似修正两种路径。我们提出了三种精确的旧logits获取策略:基于快照的版本追踪、专用旧logits模型以及通过部分展开中断实现同步,并比较了它们的系统权衡。从近似修正的角度,我们聚焦于在无法以低成本恢复精确旧logits时,通过更合适的近似策略保留解耦修正的优势,且不引入额外系统开销。基于此分析,我们采用了一种修正的PPO-EWMA方法,该方法在训练速度和优化性能上均取得了显著提升。代码地址:https://github.com/millioniron/ROLL。
English
Asynchronous reinforcement learning improves rollout throughput for large language model agents by decoupling sample generation from policy optimization, but it also introduces a critical failure mode for PPO-style off-policy correction. In heterogeneous training systems, the total importance ratio should ideally be decomposed into two semantically distinct factors: a training--inference discrepancy term that aligns inference-side and training-side distributions at the same behavior-policy version, and a policy-staleness term that constrains the update from the historical policy to the current policy. We show that practical asynchronous pipelines with delayed updates and partial rollouts often lose the required historical training-side logits, or old logits. This missing-old-logit problem entangles discrepancy repair with staleness correction, breaks the intended semantics of decoupled correction, and makes clipping and masking thresholds interact undesirably. To address this issue, we study both exact and approximate correction routes. We propose three exact old-logit acquisition strategies: snapshot-based version tracking, a dedicated old-logit model, and synchronization via partial rollout interruption, and compare their system trade-offs. From the perspective of approximate correction, we focus on preserving the benefits of decoupled correction through a more appropriate approximate policy when exact old logits cannot be recovered at low cost, without incurring extra system overhead. Following this analysis, we adopt a revised PPO-EWMA method, which achieves significant gains in both training speed and optimization performance. Code at https://github.com/millioniron/ROLL.
PDF141May 14, 2026