MHPO：基于调制风险感知策略优化的稳定强化学习

摘要

调节重要性比率对于基于群组相对策略优化（GRPO）框架的训练稳定性至关重要。然而，主流比率控制方法（如硬截断）存在不可微边界和梯度消失区域，难以维持梯度保真度。此外，这些方法缺乏危险感知机制来自适应抑制极端偏差，导致优化过程易受策略突变影响。为解决这些挑战，我们提出调制式危险感知策略优化（MHPO），这一新型框架专为实现鲁棒稳定的强化学习而设计。该框架通过引入对数保真调制器（LFM），将无界重要性比率映射至有界可微空间，有效防止高方差异常标记破坏损失景观的稳定性，同时确保全局梯度稳定。与之互补的解耦危险惩罚（DHP）机制整合生存分析中的累积危险函数，可独立调控正负双向策略偏移。通过危险感知惩罚对优化景观进行塑形，MHPO实现了非对称策略偏移的细粒度调控，既能缓解过度扩展导致的模式崩溃，又可防止灾难性收缩引发的策略退化，从而在稳定信任域内实现协同优化。在涵盖文本与视觉语言任务的多样化推理基准测试中，广泛实验表明MHPO持续优于现有方法，在显著提升训练稳定性的同时获得更优性能。

English

Regulating the importance ratio is critical for the training stability of Group Relative Policy Optimization (GRPO) based frameworks. However, prevailing ratio control methods, such as hard clipping, suffer from non-differentiable boundaries and vanishing gradient regions, failing to maintain gradient fidelity. Furthermore, these methods lack a hazard-aware mechanism to adaptively suppress extreme deviations, leaving the optimization process vulnerable to abrupt policy shifts. To address these challenges, we propose Modulated Hazard-aware Policy Optimization (MHPO), a novel framework designed for robust and stable reinforcement learning. The proposed MHPO introduces a Log-Fidelity Modulator (LFM) to map unbounded importance ratios into a bounded, differentiable domain. This mechanism effectively prevents high-variance outlier tokens from destabilizing the loss landscape while ensuring global gradient stability. Complementarily, a Decoupled Hazard Penalty (DHP) integrates cumulative hazard functions from survival analysis to independently regulate positive and negative policy shifts. By shaping the optimization landscape with hazard-aware penalties, the proposed MHPO achieves fine-grained regulation of asymmetric policy shifts simultaneously mitigating mode collapse from over-expansion and preventing policy erosion from catastrophic contraction within a stabilized trust region. Extensive evaluations on diverse reasoning benchmarks across both text-based and vision-language tasks demonstrate that MHPO consistently outperforms existing methods, achieving superior performance while significantly enhancing training stability.

MHPO：基于调制风险感知策略优化的稳定强化学习

MHPO: Modulated Hazard-aware Policy Optimization for Stable Reinforcement Learning

摘要

Support