ChatPaper.aiChatPaper

BAPO:通過自適應剪裁的平衡策略優化穩定大型語言模型的離線策略強化學習

BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping

October 21, 2025
作者: Zhiheng Xi, Xin Guo, Yang Nan, Enyu Zhou, Junrui Shen, Wenxiang Chen, Jiaqi Liu, Jixuan Huang, Zhihao Zhang, Honglin Guo, Xun Deng, Zhikai Lei, Miao Zheng, Guoteng Wang, Shuo Zhang, Peng Sun, Rui Zheng, Hang Yan, Tao Gui, Qi Zhang, Xuanjing Huang
cs.AI

摘要

強化學習(RL)最近已成為對齊和強化大型語言模型(LLMs)的核心範式。然而,在離策略(off-policy)設置中應用RL——即使用過去策略的陳舊數據進行訓練——雖然提高了樣本效率,但仍然面臨挑戰:策略熵急劇下降,優化過程往往變得不穩定,甚至可能崩潰。通過理論和實證分析,我們識別出兩個關鍵洞察:(i) 優化中的不平衡,即負優勢樣本主導了策略梯度,抑制了有用行為並可能導致梯度爆炸;(ii) 推導出的熵剪裁規則(Entropy-Clip Rule),揭示了類似PPO目標中的固定剪裁機制系統性地阻礙了增加熵的更新,從而驅使策略過度利用而犧牲探索。基於這些洞察,我們提出了帶有自適應剪裁的平衡策略優化(BAPO),這是一種簡單而有效的方法,能動態調整剪裁界限,自適應地重新平衡正負貢獻,保持熵,並穩定RL優化。在多樣的離策略場景中——包括樣本回放和部分滾動——BAPO實現了快速、穩定且數據高效的訓練。在AIME 2024和AIME 2025基準測試中,我們的7B BAPO模型超越了開源對手如SkyWork-OR1-7B,而我們的32B BAPO模型不僅在同規模模型中達到了最先進的成果,還超越了領先的專有系統如o3-mini和Gemini-2.5-Flash-Thinking。
English
Reinforcement learning (RL) has recently become the core paradigm for aligning and strengthening large language models (LLMs). Yet, applying RL in off-policy settings--where stale data from past policies are used for training--improves sample efficiency, but remains challenging: policy entropy declines sharply, optimization often becomes unstable and may even collapse. Through theoretical and empirical analysis, we identify two key insights: (i) an imbalance in optimization, where negative-advantage samples dominate the policy gradient, suppressing useful behaviors and risking gradient explosions; and (ii) the derived Entropy-Clip Rule, which reveals that the fixed clipping mechanism in PPO-like objectives systematically blocks entropy-increasing updates, thereby driving the policy toward over-exploitation at the expense of exploration. Building on these insights, we propose BAlanced Policy Optimization with Adaptive Clipping (BAPO), a simple yet effective method that dynamically adjusts clipping bounds to adaptively re-balance positive and negative contributions, preserve entropy, and stabilize RL optimization. Across diverse off-policy scenarios--including sample replay and partial rollout--BAPO achieves fast, stable, and data-efficient training. On AIME 2024 and AIME 2025 benchmarks, our 7B BAPO model surpasses open-source counterparts such as SkyWork-OR1-7B, while our 32B BAPO model not only achieves state-of-the-art results among models of the same scale but also outperforms leading proprietary systems like o3-mini and Gemini-2.5-Flash-Thinking.
PDF561October 23, 2025