繁榮與崩潰之前：基於陳舊數據的離策略強化學習在大型語言模型上能走多遠？

摘要

強化學習在大型語言模型推理的近期進展中佔據核心地位，但大多數算法依賴於需要每次更新時進行全新滾動的策略訓練，這限制了效率和可擴展性。異步強化學習系統通過將滾動生成與訓練解耦來緩解這一問題，但其有效性取決於對滾動數據中大量陳舊性的容忍，在這種情況下，現有方法要么性能下降，要么完全失效。我們重新審視這一挑戰，並發現了一個繁榮後崩潰的現象：如果利用得當，陳舊數據可以與策略數據一樣具有信息量。基於這一洞察，我們引入了M2PO（第二矩信任策略優化），它通過約束重要性權重的第二矩來僅抑制極端異常值，同時保留信息性更新。值得注意的是，M2PO在高陳舊性下顯著減少了被裁剪的token比例（從訓練中的1.22%降至0.06%），精確地屏蔽了高方差token，同時保持了優化的穩定性。在六個模型（從1.7B到32B）和八個基準上的廣泛評估表明，即使數據陳舊至少256次模型更新，M2PO也能提供穩定的離策略訓練，並與策略性能相匹配。

English

Reinforcement learning has been central to recent advances in large language model reasoning, but most algorithms rely on on-policy training that demands fresh rollouts at every update, limiting efficiency and scalability. Asynchronous RL systems alleviate this by decoupling rollout generation from training, yet their effectiveness hinges on tolerating large staleness in rollout data, a setting where existing methods either degrade in performance or collapse. We revisit this challenge and uncover a prosperity-before-collapse phenomenon: stale data can be as informative as on-policy data if exploited properly. Building on this insight, we introduce M2PO (Second-Moment Trust Policy Optimization), which constrains the second moment of importance weights to suppress only extreme outliers while preserving informative updates. Notably, M2PO sharply reduces the fraction of clipped tokens under high staleness (from 1.22% to 0.06% over training), precisely masking high-variance tokens while maintaining stable optimization. Extensive evaluation across six models (from 1.7B to 32B) and eight benchmarks shows that M2PO delivers stable off-policy training even with data stale by at least 256 model updates and matches on-policy performance.

繁榮與崩潰之前：基於陳舊數據的離策略強化學習在大型語言模型上能走多遠？

Prosperity before Collapse: How Far Can Off-Policy RL Reach with Stale Data on LLMs?

摘要

Support