SAFE：基于熵感知预测控制的强化学习从人类反馈中稳定对齐微调

摘要

近期研究将近端策略优化（PPO）确立为RLHF中强化学习部分的标准方法。PPO虽在实践中表现良好，但其启发式设计动机导致其对语言模型RLHF中的KL散度约束采取临时处理方式，存在奖励振荡、熵崩溃、价值函数漂移及策略突然发散等问题，需频繁重启和大量超参数调优。本文针对LM-RLHF场景提出一种全新的纯在线演员-评论员强化学习方法SAFE（基于熵感知控制的稳定对齐微调）。该创新算法融合了用于悲观价值估计的双重软最小评论员架构，以及结合熵门控KL调节与PID控制自适应阈值的新型多层稳定框架。与标准PPO的对称KL惩罚机制不同，SAFE能区分高熵探索与低熵模式崩溃，并根据奖励变化速度动态调整惩罚力度。在30亿参数模型上的实验表明，SAFE相比PPO实现训练平均奖励提升5.15%（0.725 vs 0.689），奖励崩溃可忽略不计，且KL控制能力显著优于PPO。本方法仅增加极小计算开销，提供可解释、防崩溃的RLHF框架，在保持激进学习速度的同时确保适合生产部署的长期稳定优化。代码已开源：https://github.com/ryyzn9/SAFE

English

Optimization (PPO) has been positioned by recent literature as the canonical method for the RL part of RLHF. PPO performs well empirically but has a heuristic motivation and handles the KL-divergence constraint used in LM-RLHF in an ad-hoc manner and suffers form reward oscillations, entropy collapse, value function drift, and sudden policy divergence that require frequent restarts and extensive hyperparameter tuning. In this paper, we develop a new pure on policy actor-critic RL method for the LM-RLHF setting. We present SAFE (Stable Alignment Finetuning with Entropy-aware control),a novel RLHF algorithm that combines a Double Soft-Min Critic for pessimistic value estimation with a new multi-layer stabilization framework combining entropy-gated KL regulation, and PID-controlled adaptive thresholds. Unlike standard PPO's symmetric KL penalties, SAFE distinguishes high-entropy exploration from low-entropy mode collapse and adjusts penalties dynamically based on reward velocity. Experiments on a 3B parameter model show SAFE achieves +5.15\% training-average reward than PPO (0.725 vs 0.689), negligible reward crashes, and superior KL control than ppo . Our method adds minimal computational overhead and provides an interpretable, crash-resistant RLHF framework that maintains aggressive learning speed while ensuring stable long-horizon optimization suitable for production deployment. Code is available at https://github.com/ryyzn9/SAFE

SAFE：基于熵感知预测控制的强化学习从人类反馈中稳定对齐微调

SAFE: Stable Alignment Finetuning with Entropy-Aware Predictive Control for RLHF

摘要

Support