BandPO:通过概率感知边界桥接信任区域与比率剪裁——大语言模型强化学习新方法
BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning
March 5, 2026
作者: Yuan Li, Bo Wang, Yufei Gao, Yuqian Yao, Xinyuan Wang, Zhangyue Yin, Xipeng Qiu
cs.AI
摘要
近端约束是保障大语言模型强化学习稳定性的核心机制。虽然PPO中的标准裁剪机制作为信赖域的高效替代方案,但我们发现其存在关键瓶颈:固定边界严格限制了低概率动作的向上更新空间,不成比例地压制了高优势值的尾部策略,进而引发熵值快速崩塌。为此,我们提出带约束策略优化(BandPO)。BandPO采用理论统一的Band算子替代标准裁剪机制,该算子将通过f-散度定义的信赖域映射至动态的概率感知裁剪区间。理论分析证实Band能有效解决这一探索瓶颈。我们将该映射构建为凸优化问题,在保证获得全局最优数值解的同时,针对特定散度推导出闭式解。跨模型与数据集的广泛实验表明,BandPO在持续优于标准裁剪与Clip-Higher方法的同时,能稳健缓解熵崩塌现象。
English
Proximal constraints are fundamental to the stability of the Large Language Model reinforcement learning. While the canonical clipping mechanism in PPO serves as an efficient surrogate for trust regions, we identify a critical bottleneck: fixed bounds strictly constrain the upward update margin of low-probability actions, disproportionately suppressing high-advantage tail strategies and inducing rapid entropy collapse. To address this, we introduce Band-constrained Policy Optimization (BandPO). BandPO replaces canonical clipping with Band, a unified theoretical operator that projects trust regions defined by f-divergences into dynamic, probability-aware clipping intervals. Theoretical analysis confirms that Band effectively resolves this exploration bottleneck. We formulate this mapping as a convex optimization problem, guaranteeing a globally optimal numerical solution while deriving closed-form solutions for specific divergences. Extensive experiments across diverse models and datasets demonstrate that BandPO consistently outperforms canonical clipping and Clip-Higher, while robustly mitigating entropy collapse.