安全なRLHF：人間のフィードバックからの安全な強化学習

要旨

大規模言語モデル（LLM）の発展に伴い、AIシステムの性能と安全性のバランスを取ることがこれまで以上に重要となっています。しかし、有用性と無害性という目的の間には本質的な緊張関係があり、LLMの訓練において大きな課題となっています。この問題に対処するため、我々はSafe Reinforcement Learning from Human Feedback（Safe RLHF）という新しい人間の価値観に沿ったアルゴリズムを提案します。Safe RLHFは、有用性と無害性に関する人間の選好を明示的に分離し、クラウドワーカーがこの緊張関係に混乱することを効果的に回避し、個別の報酬モデルとコストモデルを訓練することを可能にします。我々はLLMの安全性の問題を、指定されたコスト制約を満たしながら報酬関数を最大化する最適化タスクとして形式化します。ラグランジュ法を活用してこの制約付き問題を解決し、Safe RLHFはファインチューニング中に二つの目的のバランスを動的に調整します。Safe RLHFを用いた3回のファインチューニングを通じて、既存の価値観に沿ったアルゴリズムと比較して、有害な応答を軽減しつつモデルの性能を向上させる優れた能力を実証しました。実験的には、Safe RLHFを用いてAlpaca-7Bをファインチューニングし、収集した人間の選好に沿わせることで、人間の評価に基づいてその有用性と無害性を大幅に改善しました。

English

With the development of large language models (LLMs), striking a balance between the performance and safety of AI systems has never been more critical. However, the inherent tension between the objectives of helpfulness and harmlessness presents a significant challenge during LLM training. To address this issue, we propose Safe Reinforcement Learning from Human Feedback (Safe RLHF), a novel algorithm for human value alignment. Safe RLHF explicitly decouples human preferences regarding helpfulness and harmlessness, effectively avoiding the crowdworkers' confusion about the tension and allowing us to train separate reward and cost models. We formalize the safety concern of LLMs as an optimization task of maximizing the reward function while satisfying specified cost constraints. Leveraging the Lagrangian method to solve this constrained problem, Safe RLHF dynamically adjusts the balance between the two objectives during fine-tuning. Through a three-round fine-tuning using Safe RLHF, we demonstrate a superior ability to mitigate harmful responses while enhancing model performance compared to existing value-aligned algorithms. Experimentally, we fine-tuned the Alpaca-7B using Safe RLHF and aligned it with collected human preferences, significantly improving its helpfulness and harmlessness according to human evaluations.

安全なRLHF：人間のフィードバックからの安全な強化学習

Safe RLHF: Safe Reinforcement Learning from Human Feedback

要旨

Support