BandPO: Het overbruggen van vertrouwensregio's en ratio-clipping via waarschijnlijkheidsbewuste grenzen voor reinforcement learning van grote taalmodellen

Samenvatting

Proximale beperkingen zijn fundamenteel voor de stabiliteit van reinforcement learning bij Large Language Models. Hoewel het canonieke clipping-mechanisme in PPO dient als een efficiënte surrogaat voor vertrouwensregio's, identificeren wij een kritieke bottleneck: vaste grenzen beperken strikt de upward-updatemarge van acties met een lage waarschijnlijkheid, waarbij hoog-voordelige tail-strategieën onevenredig worden onderdrukt en een snelle entropie-implosie wordt geïnduceerd. Om dit aan te pakken, introduceren wij Band-constrained Policy Optimization (BandPO). BandPO vervangt canonieke clipping door Band, een uniforme theoretische operator die vertrouwensregio's gedefinieerd door f-divergenties projecteert in dynamische, waarschijnlijkheidsbewuste clipping-intervallen. Theoretische analyse bevestigt dat Band deze exploratiebottleneck effectief oplost. Wij formuleren deze mapping als een convex optimalisatieprobleem, waarbij een globaal optimale numerieke oplossing wordt gegarandeerd en gesloten-vorm oplossingen voor specifieke divergenties worden afgeleid. Uitgebreide experimenten met diverse modellen en datasets tonen aan dat BandPO consequent superieure prestaties levert vergeleken met canonieke clipping en Clip-Higher, terwijl het robuust entropie-implosie tegengaat.

English

Proximal constraints are fundamental to the stability of the Large Language Model reinforcement learning. While the canonical clipping mechanism in PPO serves as an efficient surrogate for trust regions, we identify a critical bottleneck: fixed bounds strictly constrain the upward update margin of low-probability actions, disproportionately suppressing high-advantage tail strategies and inducing rapid entropy collapse. To address this, we introduce Band-constrained Policy Optimization (BandPO). BandPO replaces canonical clipping with Band, a unified theoretical operator that projects trust regions defined by f-divergences into dynamic, probability-aware clipping intervals. Theoretical analysis confirms that Band effectively resolves this exploration bottleneck. We formulate this mapping as a convex optimization problem, guaranteeing a globally optimal numerical solution while deriving closed-form solutions for specific divergences. Extensive experiments across diverse models and datasets demonstrate that BandPO consistently outperforms canonical clipping and Clip-Higher, while robustly mitigating entropy collapse.

BandPO: Het overbruggen van vertrouwensregio's en ratio-clipping via waarschijnlijkheidsbewuste grenzen voor reinforcement learning van grote taalmodellen

BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning

Samenvatting

Support