비동기 에이전트 강화학습에서 누락된 과거 로짓: 오프-정책 보정을 위한 의미적 불일치 및 복구 방법

초록

비동기 강화 학습은 대규모 언어 모델 에이전트의 롤아웃 처리량을 향상시키기 위해 샘플 생성을 정책 최적화로부터 분리하지만, PPO 스타일의 오프-폴리시 보정에는 심각한 실패 모드를 도입한다. 이질적 훈련 시스템에서 전체 중요도 비율은 이상적으로 두 가지 의미적으로 구분되는 요소로 분해되어야 한다: 동일한 행동 정책 버전에서 추론 측과 훈련 측 분포를 정렬하는 훈련-추론 불일치 항과, 이전 정책에서 현재 정책으로의 업데이트를 제약하는 정책-지연성 항. 우리는 지연된 업데이트와 부분적 롤아웃을 포함하는 실제 비동기 파이프라인에서 필수적인 과거 훈련 측 로짓, 즉 이전 로짓이 종종 손실됨을 보여준다. 이러한 이전 로짓 누락 문제는 불일치 수정과 지연성 보정을 얽히게 하여 분리 보정의 의도된 의미를 깨뜨리고, 클리핑 및 마스킹 임계값이 바람직하지 않게 상호작용하게 만든다. 이 문제를 해결하기 위해 우리는 정확한 보정 경로와 근사 보정 경로를 모두 연구한다. 세 가지 정확한 이전 로짓 획득 전략, 즉 스냅샷 기반 버전 추적, 전용 이전 로짓 모델, 부분적 롤아웃 중단을 통한 동기화를 제안하고, 이들의 시스템 트레이드오프를 비교한다. 근사 보정 관점에서는, 정확한 이전 로짓을 낮은 비용으로 복구할 수 없을 때 추가적인 시스템 오버헤드 없이 더 적절한 근사 정책을 통해 분리 보정의 이점을 유지하는 데 초점을 맞춘다. 이 분석에 따라 우리는 수정된 PPO-EWMA 방법을 채택하며, 이 방법은 훈련 속도와 최적화 성능 모두에서 상당한 향상을 달성한다. 코드는 https://github.com/millioniron/ROLL에서 확인할 수 있다.

English

Asynchronous reinforcement learning improves rollout throughput for large language model agents by decoupling sample generation from policy optimization, but it also introduces a critical failure mode for PPO-style off-policy correction. In heterogeneous training systems, the total importance ratio should ideally be decomposed into two semantically distinct factors: a training--inference discrepancy term that aligns inference-side and training-side distributions at the same behavior-policy version, and a policy-staleness term that constrains the update from the historical policy to the current policy. We show that practical asynchronous pipelines with delayed updates and partial rollouts often lose the required historical training-side logits, or old logits. This missing-old-logit problem entangles discrepancy repair with staleness correction, breaks the intended semantics of decoupled correction, and makes clipping and masking thresholds interact undesirably. To address this issue, we study both exact and approximate correction routes. We propose three exact old-logit acquisition strategies: snapshot-based version tracking, a dedicated old-logit model, and synchronization via partial rollout interruption, and compare their system trade-offs. From the perspective of approximate correction, we focus on preserving the benefits of decoupled correction through a more appropriate approximate policy when exact old logits cannot be recovered at low cost, without incurring extra system overhead. Following this analysis, we adopt a revised PPO-EWMA method, which achieves significant gains in both training speed and optimization performance. Code at https://github.com/millioniron/ROLL.

비동기 에이전트 강화학습에서 누락된 과거 로짓: 오프-정책 보정을 위한 의미적 불일치 및 복구 방법

Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

초록

Support