崩壊前の繁栄：LLMにおける古いデータを用いたオフポリシー強化学習の限界はどこまでか？

要旨

強化学習は、大規模言語モデルの推論における最近の進展の中心となっているが、ほとんどのアルゴリズムは、毎回の更新で新たなロールアウトを必要とするオン・ポリシー訓練に依存しており、効率とスケーラビリティが制限されている。非同期RLシステムは、ロールアウト生成と訓練を分離することでこれを緩和するが、その有効性はロールアウトデータの大きな陳腐化を許容することにかかっており、既存の手法では性能が低下するか、崩壊する状況が生じる。我々はこの課題を再検討し、繁栄-崩壊現象を明らかにした：陳腐化したデータも適切に活用されれば、オン・ポリシーデータと同様に有益である。この洞察に基づいて、M2PO（Second-Moment Trust Policy Optimization）を導入し、重要度重みの第二モーメントを制約することで、極端な外れ値のみを抑制しつつ、有益な更新を維持する。特に、M2POは高い陳腐化下でのクリップトークンの割合を大幅に削減し（訓練中に1.22%から0.06%へ）、高分散トークンを正確にマスクしながら安定した最適化を維持する。6つのモデル（1.7Bから32B）と8つのベンチマークにわたる広範な評価により、M2POが少なくとも256回のモデル更新による陳腐化データを用いても安定したオフ・ポリシー訓練を実現し、オン・ポリシー性能に匹敵することが示された。

English

Reinforcement learning has been central to recent advances in large language model reasoning, but most algorithms rely on on-policy training that demands fresh rollouts at every update, limiting efficiency and scalability. Asynchronous RL systems alleviate this by decoupling rollout generation from training, yet their effectiveness hinges on tolerating large staleness in rollout data, a setting where existing methods either degrade in performance or collapse. We revisit this challenge and uncover a prosperity-before-collapse phenomenon: stale data can be as informative as on-policy data if exploited properly. Building on this insight, we introduce M2PO (Second-Moment Trust Policy Optimization), which constrains the second moment of importance weights to suppress only extreme outliers while preserving informative updates. Notably, M2PO sharply reduces the fraction of clipped tokens under high staleness (from 1.22% to 0.06% over training), precisely masking high-variance tokens while maintaining stable optimization. Extensive evaluation across six models (from 1.7B to 32B) and eight benchmarks shows that M2PO delivers stable off-policy training even with data stale by at least 256 model updates and matches on-policy performance.

崩壊前の繁栄：LLMにおける古いデータを用いたオフポリシー強化学習の限界はどこまでか？

Prosperity before Collapse: How Far Can Off-Policy RL Reach with Stale Data on LLMs?

要旨

Support