ChatPaper.aiChatPaper

BAPO:通过自适应裁剪的平衡策略优化稳定大语言模型的离策略强化学习

BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping

October 21, 2025
作者: Zhiheng Xi, Xin Guo, Yang Nan, Enyu Zhou, Junrui Shen, Wenxiang Chen, Jiaqi Liu, Jixuan Huang, Zhihao Zhang, Honglin Guo, Xun Deng, Zhikai Lei, Miao Zheng, Guoteng Wang, Shuo Zhang, Peng Sun, Rui Zheng, Hang Yan, Tao Gui, Qi Zhang, Xuanjing Huang
cs.AI

摘要

强化学习(RL)近期已成为对齐和增强大型语言模型(LLMs)的核心范式。然而,在离策略设置中应用RL——即利用过去策略产生的陈旧数据进行训练——虽能提升样本效率,却仍面临挑战:策略熵急剧下降,优化过程常不稳定甚至崩溃。通过理论与实证分析,我们揭示了两大关键发现:(i) 优化中的不平衡现象,即负优势样本主导了策略梯度,抑制了有益行为并可能导致梯度爆炸;(ii) 提出的熵剪裁规则,揭示了PPO类目标中固定剪裁机制系统性地阻碍了熵增更新,从而驱使策略过度开发而牺牲探索。基于这些洞察,我们提出了自适应剪裁的平衡策略优化(BAPO),这是一种简单而有效的方法,通过动态调整剪裁边界来自适应地重新平衡正负贡献,保持熵值,并稳定RL优化。在多样化的离策略场景中——包括样本回放和部分轨迹——BAPO实现了快速、稳定且数据高效的训练。在AIME 2024和AIME 2025基准测试中,我们的7B BAPO模型超越了SkyWork-OR1-7B等开源对手,而32B BAPO模型不仅在同规模模型中达到顶尖水平,还超越了o3-mini和Gemini-2.5-Flash-Thinking等领先的专有系统。
English
Reinforcement learning (RL) has recently become the core paradigm for aligning and strengthening large language models (LLMs). Yet, applying RL in off-policy settings--where stale data from past policies are used for training--improves sample efficiency, but remains challenging: policy entropy declines sharply, optimization often becomes unstable and may even collapse. Through theoretical and empirical analysis, we identify two key insights: (i) an imbalance in optimization, where negative-advantage samples dominate the policy gradient, suppressing useful behaviors and risking gradient explosions; and (ii) the derived Entropy-Clip Rule, which reveals that the fixed clipping mechanism in PPO-like objectives systematically blocks entropy-increasing updates, thereby driving the policy toward over-exploitation at the expense of exploration. Building on these insights, we propose BAlanced Policy Optimization with Adaptive Clipping (BAPO), a simple yet effective method that dynamically adjusts clipping bounds to adaptively re-balance positive and negative contributions, preserve entropy, and stabilize RL optimization. Across diverse off-policy scenarios--including sample replay and partial rollout--BAPO achieves fast, stable, and data-efficient training. On AIME 2024 and AIME 2025 benchmarks, our 7B BAPO model surpasses open-source counterparts such as SkyWork-OR1-7B, while our 32B BAPO model not only achieves state-of-the-art results among models of the same scale but also outperforms leading proprietary systems like o3-mini and Gemini-2.5-Flash-Thinking.
PDF561October 23, 2025