非同期エージェントRLにおける旧ロジットの欠落：オフポリシー補正のための意味的不整合と修復手法

要旨

非同期強化学習は、サンプル生成と方策最適化を分離することで、大規模言語モデルエージェントのロールアウトスループットを向上させるが、同時にPPO型の方策オフ補正において重大な障害モードを導入する。異種混合学習システムでは、総重要度比は理想的には二つの意味的に異なる因子に分解されるべきである。すなわち、同一行動方策バージョンにおいて推論側と訓練側の分布を整合させる訓練-推論不一致項と、履歴方策から現在方策への更新を制約する方策陳腐化項である。我々は、遅延更新と部分ロールアウトを伴う実用的な非同期パイプラインでは、必要な履歴訓練側ロジット、すなわち旧ロジットがしばしば失われることを示す。この旧ロジット欠落問題は、不一致修正と陳腐化補正を絡み合わせ、意図された分離補正の意味論を破壊し、クリッピングおよびマスキングしきい値を望ましくない形で相互作用させる。この問題に対処するため、我々は正確な補正と近似補正の両方の方法を研究する。三つの正確な旧ロジット取得戦略、すなわちスナップショットベースのバージョン追跡、専用の旧ロジットモデル、および部分ロールアウト中断による同期を提案し、それらのシステム上のトレードオフを比較する。近似補正の観点からは、正確な旧ロジットを低コストで復元できない場合に、追加のシステムオーバーヘッドを伴わずに、より適切な近似方策を通じて分離補正の利点を維持することに焦点を当てる。この分析に基づき、改訂版PPO-EWMA法を採用し、訓練速度と最適化性能の両方において顕著な向上を達成した。コードはhttps://github.com/millioniron/ROLLにある。

English

Asynchronous reinforcement learning improves rollout throughput for large language model agents by decoupling sample generation from policy optimization, but it also introduces a critical failure mode for PPO-style off-policy correction. In heterogeneous training systems, the total importance ratio should ideally be decomposed into two semantically distinct factors: a training--inference discrepancy term that aligns inference-side and training-side distributions at the same behavior-policy version, and a policy-staleness term that constrains the update from the historical policy to the current policy. We show that practical asynchronous pipelines with delayed updates and partial rollouts often lose the required historical training-side logits, or old logits. This missing-old-logit problem entangles discrepancy repair with staleness correction, breaks the intended semantics of decoupled correction, and makes clipping and masking thresholds interact undesirably. To address this issue, we study both exact and approximate correction routes. We propose three exact old-logit acquisition strategies: snapshot-based version tracking, a dedicated old-logit model, and synchronization via partial rollout interruption, and compare their system trade-offs. From the perspective of approximate correction, we focus on preserving the benefits of decoupled correction through a more appropriate approximate policy when exact old logits cannot be recovered at low cost, without incurring extra system overhead. Following this analysis, we adopt a revised PPO-EWMA method, which achieves significant gains in both training speed and optimization performance. Code at https://github.com/millioniron/ROLL.

非同期エージェントRLにおける旧ロジットの欠落：オフポリシー補正のための意味的不整合と修復手法

Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

要旨

Support