BandPO：透過機率感知邊界為大型語言模型強化學習搭建信任區域與比率裁剪之橋

摘要

鄰近約束是大型語言模型強化學習穩定性的基礎。雖然PPO中的標準剪切機制可作為信賴區域的有效替代方案，但我們發現關鍵瓶頸：固定邊界嚴格限制了低機率動作的向上更新空間，不成比例地壓制高優勢的尾部策略，並引發快速熵崩潰。為解決此問題，我們提出帶約束策略優化（BandPO）。BandPO以Band取代標準剪切機制——這是一個統一理論運算元，能將由f散度定義的信賴區域投影至動態的機率感知剪切區間。理論分析證實Band有效解決了此探索瓶頸。我們將此映射建構為凸優化問題，在推導特定散度閉合解的同時保證全局最優數值解。跨越多種模型與數據集的大規模實驗表明，BandPO在持續優於標準剪切與Clip-Higher方法的同時，能有效緩解熵崩潰現象。

English

Proximal constraints are fundamental to the stability of the Large Language Model reinforcement learning. While the canonical clipping mechanism in PPO serves as an efficient surrogate for trust regions, we identify a critical bottleneck: fixed bounds strictly constrain the upward update margin of low-probability actions, disproportionately suppressing high-advantage tail strategies and inducing rapid entropy collapse. To address this, we introduce Band-constrained Policy Optimization (BandPO). BandPO replaces canonical clipping with Band, a unified theoretical operator that projects trust regions defined by f-divergences into dynamic, probability-aware clipping intervals. Theoretical analysis confirms that Band effectively resolves this exploration bottleneck. We formulate this mapping as a convex optimization problem, guaranteeing a globally optimal numerical solution while deriving closed-form solutions for specific divergences. Extensive experiments across diverse models and datasets demonstrate that BandPO consistently outperforms canonical clipping and Clip-Higher, while robustly mitigating entropy collapse.

BandPO：透過機率感知邊界為大型語言模型強化學習搭建信任區域與比率裁剪之橋

BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning

摘要

Support