繁荣与崩塌：基于陈旧数据的离策略强化学习在大型语言模型上的探索边界

摘要

强化学习在推动大规模语言模型推理的近期进展中占据核心地位，但多数算法依赖于需在每次更新时重新生成轨迹的在线策略训练，这限制了效率与可扩展性。异步强化学习系统通过解耦轨迹生成与训练环节缓解了这一问题，然而其效能取决于对轨迹数据高度陈旧性的容忍度，在此情境下，现有方法要么性能下降，要么完全失效。我们重新审视这一挑战，揭示了一种“繁荣至崩溃”现象：若妥善利用，陈旧数据可如同在线策略数据一样富含信息。基于这一洞见，我们提出了M2PO（二阶矩信任策略优化），它通过约束重要性权重的二阶矩，仅抑制极端异常值，同时保留信息丰富的更新。值得注意的是，M2PO在高陈旧性条件下显著减少了被裁剪标记的比例（训练过程中从1.22%降至0.06%），精准屏蔽高方差标记的同时保持了优化的稳定性。在六个模型（从17亿到320亿参数）和八个基准上的广泛评估表明，M2PO即便在数据陈旧度至少达到256次模型更新的情况下，也能实现稳定的离线策略训练，并匹敌在线策略的性能。

English

Reinforcement learning has been central to recent advances in large language model reasoning, but most algorithms rely on on-policy training that demands fresh rollouts at every update, limiting efficiency and scalability. Asynchronous RL systems alleviate this by decoupling rollout generation from training, yet their effectiveness hinges on tolerating large staleness in rollout data, a setting where existing methods either degrade in performance or collapse. We revisit this challenge and uncover a prosperity-before-collapse phenomenon: stale data can be as informative as on-policy data if exploited properly. Building on this insight, we introduce M2PO (Second-Moment Trust Policy Optimization), which constrains the second moment of importance weights to suppress only extreme outliers while preserving informative updates. Notably, M2PO sharply reduces the fraction of clipped tokens under high staleness (from 1.22% to 0.06% over training), precisely masking high-variance tokens while maintaining stable optimization. Extensive evaluation across six models (from 1.7B to 32B) and eight benchmarks shows that M2PO delivers stable off-policy training even with data stale by at least 256 model updates and matches on-policy performance.

繁荣与崩塌：基于陈旧数据的离策略强化学习在大型语言模型上的探索边界

Prosperity before Collapse: How Far Can Off-Policy RL Reach with Stale Data on LLMs?

摘要

Support