安全RLHF：来自人类反馈的安全强化学习

摘要

随着大型语言模型（LLMs）的发展，平衡AI系统性能和安全性的重要性变得前所未有。然而，在LLM训练过程中，有关帮助性和无害性目标之间的固有紧张关系提出了重大挑战。为解决这一问题，我们提出了一种新颖的算法，即来自人类反馈的安全强化学习（Safe RLHF），用于人类价值观对齐。Safe RLHF明确地将有关帮助性和无害性的人类偏好解耦，有效地避免了众包工作者对紧张关系的困惑，并允许我们训练独立的奖励和成本模型。我们将LLMs的安全性问题形式化为最大化奖励函数同时满足指定成本约束的优化任务。利用Lagrange方法解决这一受限问题，Safe RLHF在微调过程中动态调整了两个目标之间的平衡。通过使用Safe RLHF进行三轮微调，我们展示了相对于现有价值对齐算法，更好地减轻有害响应并提升模型性能的能力。在实验中，我们使用Safe RLHF对Alpaca-7B进行微调，并将其与收集到的人类偏好进行对齐，根据人类评估显著提高了其帮助性和无害性。

English

With the development of large language models (LLMs), striking a balance between the performance and safety of AI systems has never been more critical. However, the inherent tension between the objectives of helpfulness and harmlessness presents a significant challenge during LLM training. To address this issue, we propose Safe Reinforcement Learning from Human Feedback (Safe RLHF), a novel algorithm for human value alignment. Safe RLHF explicitly decouples human preferences regarding helpfulness and harmlessness, effectively avoiding the crowdworkers' confusion about the tension and allowing us to train separate reward and cost models. We formalize the safety concern of LLMs as an optimization task of maximizing the reward function while satisfying specified cost constraints. Leveraging the Lagrangian method to solve this constrained problem, Safe RLHF dynamically adjusts the balance between the two objectives during fine-tuning. Through a three-round fine-tuning using Safe RLHF, we demonstrate a superior ability to mitigate harmful responses while enhancing model performance compared to existing value-aligned algorithms. Experimentally, we fine-tuned the Alpaca-7B using Safe RLHF and aligned it with collected human preferences, significantly improving its helpfulness and harmlessness according to human evaluations.

安全RLHF：来自人类反馈的安全强化学习

Safe RLHF: Safe Reinforcement Learning from Human Feedback

摘要

Support