SAFE：基于熵感知预测控制的强化学习从人类反馈稳定对齐微调

摘要

近期文献将近端策略优化（PPO）定位为RLHF中强化学习部分的标准方法。PPO虽在实践中表现良好，但其启发式设计动机、对LM-RLHF中KL散度约束的临时处理方式，以及存在的奖励震荡、熵崩塌、价值函数漂移和策略突然发散等问题，常需频繁重启训练并依赖大量超参数调优。本文针对LM-RLHF场景提出一种全新的纯在线演员-评论员强化学习方法——SAFE（基于熵感知控制的稳定对齐微调）。该创新RLHF算法融合了用于悲观价值估计的双重软最小评论家架构，以及结合熵门控KL调节与PID控制自适应阈值的新型多层稳定框架。与标准PPO的对称KL惩罚机制不同，SAFE能区分高熵探索与低熵模式崩塌，并基于奖励变化速度动态调整惩罚力度。在30亿参数模型上的实验表明，SAFE相比PPO实现了+5.15%的训练平均奖励（0.725对0.689），几乎无奖励崩溃现象，且KL控制能力显著优于PPO。本方法仅增加极小计算开销，构建出可解释、抗崩溃的RLHF框架，在保持激进学习速度的同时确保适合生产部署的长期稳定优化。代码已发布于https://github.com/ryyzn9/SAFE。

English

Optimization (PPO) has been positioned by recent literature as the canonical method for the RL part of RLHF. PPO performs well empirically but has a heuristic motivation and handles the KL-divergence constraint used in LM-RLHF in an ad-hoc manner and suffers form reward oscillations, entropy collapse, value function drift, and sudden policy divergence that require frequent restarts and extensive hyperparameter tuning. In this paper, we develop a new pure on policy actor-critic RL method for the LM-RLHF setting. We present SAFE (Stable Alignment Finetuning with Entropy-aware control),a novel RLHF algorithm that combines a Double Soft-Min Critic for pessimistic value estimation with a new multi-layer stabilization framework combining entropy-gated KL regulation, and PID-controlled adaptive thresholds. Unlike standard PPO's symmetric KL penalties, SAFE distinguishes high-entropy exploration from low-entropy mode collapse and adjusts penalties dynamically based on reward velocity. Experiments on a 3B parameter model show SAFE achieves +5.15\% training-average reward than PPO (0.725 vs 0.689), negligible reward crashes, and superior KL control than ppo . Our method adds minimal computational overhead and provides an interpretable, crash-resistant RLHF framework that maintains aggressive learning speed while ensuring stable long-horizon optimization suitable for production deployment. Code is available at https://github.com/ryyzn9/SAFE

SAFE：基于熵感知预测控制的强化学习从人类反馈稳定对齐微调

SAFE: Stable Alignment Finetuning with Entropy-Aware Predictive Control for RLHF

摘要

Support