異步智能體強化學習中缺失的舊logits:離策略校正的語義不匹配與修復方法
Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction
May 12, 2026
作者: Zhong Guan, Yongjian Guo, Haoran Sun, Wen Huang, Shuai Di, Xiong Jun Wu, Likang Wu, Hongke Zhao
cs.AI
摘要
非同步強化學習透過將樣本生成與策略最佳化解耦,提升了大型語言模型代理的行為生成吞吐量,但同時也為PPO風格的離策略校正引入了一個關鍵的失敗模式。在異質訓練系統中,總體重要性比值理應被分解為兩個語義上截然不同的因子:一個是訓練-推論不一致性項,用於在相同行為策略版本下對齊推論側與訓練側的分佈;另一個是策略過時性項,用於約束從歷史策略到當前策略的更新。我們證明,在帶有延遲更新與部分行為生成的實際非同步管線中,往往會遺失所需的歷史訓練側logits(即舊logits)。這種缺失舊logits的問題會導致不一致性修復與過時性校正相互糾纏,破壞了解耦校正的預期語義,並使裁剪與屏蔽閾值產生不當的交互作用。為了解決此問題,我們研究了精確校正與近似校正兩條路徑。我們提出了三種精確獲取舊logits的策略:基於快照的版本追蹤、專用舊logits模型,以及透過部分行為生成中斷進行同步,並比較了它們在系統層面的取捨。從近似校正的角度來看,我們著重於在無法以低成本精確恢復舊logits時,透過更合適的近似策略來保留解耦校正的優勢,且不增加額外的系統開銷。根據此分析,我們採用了改進後的PPO-EWMA方法,該方法在訓練速度與最佳化效能上均取得了顯著提升。程式碼位於 https://github.com/millioniron/ROLL。
English
Asynchronous reinforcement learning improves rollout throughput for large language model agents by decoupling sample generation from policy optimization, but it also introduces a critical failure mode for PPO-style off-policy correction. In heterogeneous training systems, the total importance ratio should ideally be decomposed into two semantically distinct factors: a training--inference discrepancy term that aligns inference-side and training-side distributions at the same behavior-policy version, and a policy-staleness term that constrains the update from the historical policy to the current policy. We show that practical asynchronous pipelines with delayed updates and partial rollouts often lose the required historical training-side logits, or old logits. This missing-old-logit problem entangles discrepancy repair with staleness correction, breaks the intended semantics of decoupled correction, and makes clipping and masking thresholds interact undesirably. To address this issue, we study both exact and approximate correction routes. We propose three exact old-logit acquisition strategies: snapshot-based version tracking, a dedicated old-logit model, and synchronization via partial rollout interruption, and compare their system trade-offs. From the perspective of approximate correction, we focus on preserving the benefits of decoupled correction through a more appropriate approximate policy when exact old logits cannot be recovered at low cost, without incurring extra system overhead. Following this analysis, we adopt a revised PPO-EWMA method, which achieves significant gains in both training speed and optimization performance. Code at https://github.com/millioniron/ROLL.