SAFE:基于熵感知预测控制的强化学习从人类反馈中稳定对齐微调
SAFE: Stable Alignment Finetuning with Entropy-Aware Predictive Control for RLHF
February 4, 2026
作者: Dipan Maity
cs.AI
摘要
近期研究将近端策略优化(PPO)确立为RLHF中强化学习部分的标准方法。PPO虽在实践中表现良好,但其启发式设计动机导致其对语言模型RLHF中的KL散度约束采取临时处理方式,存在奖励振荡、熵崩溃、价值函数漂移及策略突然发散等问题,需频繁重启和大量超参数调优。本文针对LM-RLHF场景提出一种全新的纯在线演员-评论员强化学习方法SAFE(基于熵感知控制的稳定对齐微调)。该创新算法融合了用于悲观价值估计的双重软最小评论员架构,以及结合熵门控KL调节与PID控制自适应阈值的新型多层稳定框架。与标准PPO的对称KL惩罚机制不同,SAFE能区分高熵探索与低熵模式崩溃,并根据奖励变化速度动态调整惩罚力度。在30亿参数模型上的实验表明,SAFE相比PPO实现训练平均奖励提升5.15%(0.725 vs 0.689),奖励崩溃可忽略不计,且KL控制能力显著优于PPO。本方法仅增加极小计算开销,提供可解释、防崩溃的RLHF框架,在保持激进学习速度的同时确保适合生产部署的长期稳定优化。代码已开源:https://github.com/ryyzn9/SAFE
English
Optimization (PPO) has been positioned by recent literature as the canonical method for the RL part of RLHF. PPO performs well empirically but has a heuristic motivation and handles the KL-divergence constraint used in LM-RLHF in an ad-hoc manner and suffers form reward oscillations, entropy collapse, value function drift, and sudden policy divergence that require frequent restarts and extensive hyperparameter tuning. In this paper, we develop a new pure on policy actor-critic RL method for the LM-RLHF setting. We present SAFE (Stable Alignment Finetuning with Entropy-aware control),a novel RLHF algorithm that combines a Double Soft-Min Critic for pessimistic value estimation with a new multi-layer stabilization framework combining entropy-gated KL regulation, and PID-controlled adaptive thresholds. Unlike standard PPO's symmetric KL penalties, SAFE distinguishes high-entropy exploration from low-entropy mode collapse and adjusts penalties dynamically based on reward velocity. Experiments on a 3B parameter model show SAFE achieves +5.15\% training-average reward than PPO (0.725 vs 0.689), negligible reward crashes, and superior KL control than ppo . Our method adds minimal computational overhead and provides an interpretable, crash-resistant RLHF framework that maintains aggressive learning speed while ensuring stable long-horizon optimization suitable for production deployment. Code is available at https://github.com/ryyzn9/SAFE