SAFE:基于熵感知预测控制的强化学习从人类反馈稳定对齐微调
SAFE: Stable Alignment Finetuning with Entropy-Aware Predictive Control for RLHF
February 4, 2026
作者: Dipan Maity
cs.AI
摘要
近期文献将近端策略优化(PPO)定位为RLHF中强化学习部分的标准方法。PPO虽在实践中表现良好,但其启发式设计动机、对LM-RLHF中KL散度约束的临时处理方式,以及存在的奖励震荡、熵崩塌、价值函数漂移和策略突然发散等问题,常需频繁重启训练并依赖大量超参数调优。本文针对LM-RLHF场景提出一种全新的纯在线演员-评论员强化学习方法——SAFE(基于熵感知控制的稳定对齐微调)。该创新RLHF算法融合了用于悲观价值估计的双重软最小评论家架构,以及结合熵门控KL调节与PID控制自适应阈值的新型多层稳定框架。与标准PPO的对称KL惩罚机制不同,SAFE能区分高熵探索与低熵模式崩塌,并基于奖励变化速度动态调整惩罚力度。在30亿参数模型上的实验表明,SAFE相比PPO实现了+5.15%的训练平均奖励(0.725对0.689),几乎无奖励崩溃现象,且KL控制能力显著优于PPO。本方法仅增加极小计算开销,构建出可解释、抗崩溃的RLHF框架,在保持激进学习速度的同时确保适合生产部署的长期稳定优化。代码已发布于https://github.com/ryyzn9/SAFE。
English
Optimization (PPO) has been positioned by recent literature as the canonical method for the RL part of RLHF. PPO performs well empirically but has a heuristic motivation and handles the KL-divergence constraint used in LM-RLHF in an ad-hoc manner and suffers form reward oscillations, entropy collapse, value function drift, and sudden policy divergence that require frequent restarts and extensive hyperparameter tuning. In this paper, we develop a new pure on policy actor-critic RL method for the LM-RLHF setting. We present SAFE (Stable Alignment Finetuning with Entropy-aware control),a novel RLHF algorithm that combines a Double Soft-Min Critic for pessimistic value estimation with a new multi-layer stabilization framework combining entropy-gated KL regulation, and PID-controlled adaptive thresholds. Unlike standard PPO's symmetric KL penalties, SAFE distinguishes high-entropy exploration from low-entropy mode collapse and adjusts penalties dynamically based on reward velocity. Experiments on a 3B parameter model show SAFE achieves +5.15\% training-average reward than PPO (0.725 vs 0.689), negligible reward crashes, and superior KL control than ppo . Our method adds minimal computational overhead and provides an interpretable, crash-resistant RLHF framework that maintains aggressive learning speed while ensuring stable long-horizon optimization suitable for production deployment. Code is available at https://github.com/ryyzn9/SAFE