MHPO：基于调制风险感知策略优化的稳定强化学习

摘要

调节重要性比率对于基于群组相对策略优化（GRPO）框架的训练稳定性至关重要。然而，现有比率控制方法（如硬截断）存在不可微边界和梯度消失区域，难以维持梯度保真度。此外，这些方法缺乏危险感知机制来自适应抑制极端偏差，导致优化过程易受策略突变影响。为解决这些问题，我们提出调制式危险感知策略优化（MHPO）——一个面向鲁棒稳定强化学习的新型框架。该框架通过对数保真调制器（LFM）将无界重要性比率映射至有界可微空间，既能有效防止高方差异常值破坏损失景观的稳定性，又可确保全局梯度稳定。同时，解耦危险惩罚（DHP）模块引入生存分析中的累积危险函数，分别对正负向策略偏移进行独立调控。通过危险感知惩罚重塑优化景观，MHPO实现了非对称策略偏移的精细调控，既能同步缓解过度扩张导致的模式坍塌，又可防止灾难性收缩引发的策略退化，最终在稳定信任域内实现优化。在涵盖文本与视觉语言任务的多样化推理基准测试中，MHPO持续超越现有方法，在显著提升训练稳定性的同时获得更优性能。

English

Regulating the importance ratio is critical for the training stability of Group Relative Policy Optimization (GRPO) based frameworks. However, prevailing ratio control methods, such as hard clipping, suffer from non-differentiable boundaries and vanishing gradient regions, failing to maintain gradient fidelity. Furthermore, these methods lack a hazard-aware mechanism to adaptively suppress extreme deviations, leaving the optimization process vulnerable to abrupt policy shifts. To address these challenges, we propose Modulated Hazard-aware Policy Optimization (MHPO), a novel framework designed for robust and stable reinforcement learning. The proposed MHPO introduces a Log-Fidelity Modulator (LFM) to map unbounded importance ratios into a bounded, differentiable domain. This mechanism effectively prevents high-variance outlier tokens from destabilizing the loss landscape while ensuring global gradient stability. Complementarily, a Decoupled Hazard Penalty (DHP) integrates cumulative hazard functions from survival analysis to independently regulate positive and negative policy shifts. By shaping the optimization landscape with hazard-aware penalties, the proposed MHPO achieves fine-grained regulation of asymmetric policy shifts simultaneously mitigating mode collapse from over-expansion and preventing policy erosion from catastrophic contraction within a stabilized trust region. Extensive evaluations on diverse reasoning benchmarks across both text-based and vision-language tasks demonstrate that MHPO consistently outperforms existing methods, achieving superior performance while significantly enhancing training stability.