SALSA: 強化学習におけるより強力な適応のためのスープベースのアラインメント学習

要旨

大規模言語モデル（LLM）の開発において、人間のフィードバックからの強化学習（RLHF）は、モデルを人間の価値観や選好に合わせるために重要です。RLHFは、通常、現在のポリシーと凍結された初期ポリシーとの間のKullback-Leibler（KL）ダイバージェンスを参照として使用し、これはProximal Policy Optimization（PPO）などのポリシー最適化アルゴリズムにペナルティとして追加されます。この制約により、モデルが初期チェックポイントから大きく逸脱することを防ぎますが、報酬の領域の探索を制限し、モデルがより高品質な解を発見する能力を低下させます。その結果、ポリシー最適化は、パラメータ空間の狭い領域に閉じ込められ、最適でないアライメントとパフォーマンスをもたらします。本論文では、SALSA（Soup-based Alignment Learning for Stronger Adaptation）という新しいアプローチを提案し、これらの制約を克服するために、2つの独立した教師ありファインチューニング（SFT）モデルの重み空間の平均化によって、より柔軟で適切な参照モデルを作成します。このモデルスープにより、KLダイバージェンスでの大きな逸脱と、安定性を犠牲にすることなく解の空間の有望な領域の探索が可能となります。このより堅牢な参照モデルを活用することで、SALSAはより良い探索を促進し、より高い報酬を達成し、モデルの堅牢性、分布外汎化、およびパフォーマンスを向上させます。我々は、人気のあるオープンモデル（Llama2-7B、Mistral-7B、Gemma-2B）に対する幅広いベンチマーク（MT-Bench、Arena-Hard、UltraFeedback）での詳細な実験を通じて、SALSAの効果を検証し、LLMにおいてPPOを常に上回る深い探索を促進し、優れたアライメントを達成します。

English

In Large Language Model (LLM) development, Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning models with human values and preferences. RLHF traditionally relies on the Kullback-Leibler (KL) divergence between the current policy and a frozen initial policy as a reference, which is added as a penalty in policy optimization algorithms like Proximal Policy Optimization (PPO). While this constraint prevents models from deviating too far from the initial checkpoint, it limits exploration of the reward landscape, reducing the model's ability to discover higher-quality solutions. As a result, policy optimization is often trapped in a narrow region of the parameter space, leading to suboptimal alignment and performance. This paper presents SALSA (Soup-based Alignment Learning for Stronger Adaptation), a novel approach designed to overcome these limitations by creating a more flexible and better located reference model through weight-space averaging of two independent supervised fine-tuned (SFT) models. This model soup allows for larger deviation in KL divergence and exploring a promising region of the solution space without sacrificing stability. By leveraging this more robust reference model, SALSA fosters better exploration, achieving higher rewards and improving model robustness, out-of-distribution generalization, and performance. We validate the effectiveness of SALSA through extensive experiments on popular open models (Llama2-7B, Mistral-7B, and Gemma-2B) across various benchmarks (MT-Bench, Arena-Hard, UltraFeedback), where it consistently surpasses PPO by fostering deeper exploration and achieving superior alignment in LLMs.

SALSA: 強化学習におけるより強力な適応のためのスープベースのアラインメント学習

SALSA: Soup-based Alignment Learning for Stronger Adaptation in RLHF

要旨

Support