安全RLHF：從人類反饋中進行安全強化學習

摘要

隨著大型語言模型（LLMs）的發展，平衡AI系統性能與安全性變得更加關鍵。然而，在LLM訓練過程中，有關幫助性和無害性目標之間的內在張力帶來了重大挑戰。為了解決這個問題，我們提出了一種新的算法，即從人類反饋中實現安全強化學習（Safe RLHF），用於人類價值對齊。Safe RLHF明確地將人類對於幫助性和無害性的偏好解耦，有效地避免了工人群對於張力的困惑，並允許我們訓練獨立的獎勵和成本模型。我們將LLMs的安全問題形式化為最大化獎勵函數並滿足指定成本約束的優化任務。通過利用Lagrange方法解決這個受限問題，Safe RLHF在微調過程中動態調整了兩個目標之間的平衡。通過使用Safe RLHF進行三輪微調，我們展示了相對於現有價值對齊算法，更好地減輕有害回應並提升模型性能的能力。在實驗中，我們使用Safe RLHF對Alpaca-7B進行微調，並將其與收集到的人類偏好對齊，根據人類評估，顯著提高了其幫助性和無害性。

English

With the development of large language models (LLMs), striking a balance between the performance and safety of AI systems has never been more critical. However, the inherent tension between the objectives of helpfulness and harmlessness presents a significant challenge during LLM training. To address this issue, we propose Safe Reinforcement Learning from Human Feedback (Safe RLHF), a novel algorithm for human value alignment. Safe RLHF explicitly decouples human preferences regarding helpfulness and harmlessness, effectively avoiding the crowdworkers' confusion about the tension and allowing us to train separate reward and cost models. We formalize the safety concern of LLMs as an optimization task of maximizing the reward function while satisfying specified cost constraints. Leveraging the Lagrangian method to solve this constrained problem, Safe RLHF dynamically adjusts the balance between the two objectives during fine-tuning. Through a three-round fine-tuning using Safe RLHF, we demonstrate a superior ability to mitigate harmful responses while enhancing model performance compared to existing value-aligned algorithms. Experimentally, we fine-tuned the Alpaca-7B using Safe RLHF and aligned it with collected human preferences, significantly improving its helpfulness and harmlessness according to human evaluations.

安全RLHF：從人類反饋中進行安全強化學習

Safe RLHF: Safe Reinforcement Learning from Human Feedback

摘要

Support