MHPO: 安定強化学習のための変調型ハザード認識方策最適化

要旨

重要度比率の制御は、Group Relative Policy Optimization（GRPO）に基づくフレームワークの学習安定性において極めて重要である。しかし、ハードクリッピングのような従来の比率制御手法は、非微分可能な境界と勾配消失領域を有しており、勾配の忠実性を維持できないという問題を抱えている。さらに、これらの手法は極端な偏差を適応的に抑制するハザード認識メカニズムを備えておらず、急激な方策シフトに対して最適化プロセスが脆弱になる。これらの課題に対処するため、我々はロバストで安定した強化学習を実現する新しいフレームワークであるModulated Hazard-aware Policy Optimization（MHPO）を提案する。提案するMHPOは、Log-Fidelity Modulator（LFM）を導入し、有界ではない重要度比率を有界かつ微分可能な領域に写像する。この機構は、高分散の外れ値トークンが損失ランドスケープを不安定化することを効果的に防止するとともに、大域的な勾配安定性を保証する。相補的に、Decoupled Hazard Penalty（DHP）は、生存時間解析からの累積ハザード関数を統合し、正負の方策シフトを独立して制御する。ハザード認識ペナルティによって最適化ランドスケープを形成することにより、提案するMHPOは非対称な方策シフトの細粒度な制御を実現し、過剰な拡張によるモード崩壊を軽減するとともに、安定化された信頼領域内での壊滅的な収縮による方策の劣化を防止する。テキストベース及び視覚言語タスクにわたる多様な推論ベンチマークでの広範な評価により、MHPOが既存手法を一貫して上回り、優れた性能を達成するとともに学習安定性を大幅に向上させることを実証した。

English

Regulating the importance ratio is critical for the training stability of Group Relative Policy Optimization (GRPO) based frameworks. However, prevailing ratio control methods, such as hard clipping, suffer from non-differentiable boundaries and vanishing gradient regions, failing to maintain gradient fidelity. Furthermore, these methods lack a hazard-aware mechanism to adaptively suppress extreme deviations, leaving the optimization process vulnerable to abrupt policy shifts. To address these challenges, we propose Modulated Hazard-aware Policy Optimization (MHPO), a novel framework designed for robust and stable reinforcement learning. The proposed MHPO introduces a Log-Fidelity Modulator (LFM) to map unbounded importance ratios into a bounded, differentiable domain. This mechanism effectively prevents high-variance outlier tokens from destabilizing the loss landscape while ensuring global gradient stability. Complementarily, a Decoupled Hazard Penalty (DHP) integrates cumulative hazard functions from survival analysis to independently regulate positive and negative policy shifts. By shaping the optimization landscape with hazard-aware penalties, the proposed MHPO achieves fine-grained regulation of asymmetric policy shifts simultaneously mitigating mode collapse from over-expansion and preventing policy erosion from catastrophic contraction within a stabilized trust region. Extensive evaluations on diverse reasoning benchmarks across both text-based and vision-language tasks demonstrate that MHPO consistently outperforms existing methods, achieving superior performance while significantly enhancing training stability.

MHPO: 安定強化学習のための変調型ハザード認識方策最適化

MHPO: Modulated Hazard-aware Policy Optimization for Stable Reinforcement Learning

要旨

Support