BandPO: 信頼領域と比率クリッピングを確率対応バウンドで統合する大規模言語モデルの強化学習手法

要旨

近接制約は大規模言語モデルの強化学習における安定性の基盤をなす。PPOの標準的なクリッピング機構は信頼領域の効率的な代替手段として機能するが、我々は決定的なボトルネックを特定した：固定境界は低確率行動の上方更新マージンを厳格に制約し、高アドバンテージを持つテール戦略を不均衡に抑制することで、急激なエントロピー崩壊を誘発する。この問題に対処するため、我々はBand-constrained Policy Optimization（BandPO）を提案する。BandPOは標準クリッピングをBandで置き換える。これはf-ダイバージェンスで定義される信頼領域を確率を考慮した動的クリッピング区間に射影する統一理論演算子である。理論分析により、Bandがこの探索ボトルネックを効果的に解決することを確認した。我々はこの写像を凸最適化問題として定式化し、大域的最適数値解を保証するとともに、特定のダイバージェンスに対する閉形式解を導出する。多様なモデルとデータセットを用いた大規模実験により、BandPOが標準クリッピングおよびClip-Higherを一貫して上回り、エントロピー崩壊を頑健に緩和することを実証した。

English

Proximal constraints are fundamental to the stability of the Large Language Model reinforcement learning. While the canonical clipping mechanism in PPO serves as an efficient surrogate for trust regions, we identify a critical bottleneck: fixed bounds strictly constrain the upward update margin of low-probability actions, disproportionately suppressing high-advantage tail strategies and inducing rapid entropy collapse. To address this, we introduce Band-constrained Policy Optimization (BandPO). BandPO replaces canonical clipping with Band, a unified theoretical operator that projects trust regions defined by f-divergences into dynamic, probability-aware clipping intervals. Theoretical analysis confirms that Band effectively resolves this exploration bottleneck. We formulate this mapping as a convex optimization problem, guaranteeing a globally optimal numerical solution while deriving closed-form solutions for specific divergences. Extensive experiments across diverse models and datasets demonstrate that BandPO consistently outperforms canonical clipping and Clip-Higher, while robustly mitigating entropy collapse.

BandPO: 信頼領域と比率クリッピングを確率対応バウンドで統合する大規模言語モデルの強化学習手法

BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning

要旨

Support