WARP：權重平均獎勵策略的優勢

摘要

從人類反饋中進行強化學習（RLHF）通過鼓勵大型語言模型（LLMs）生成高獎勵的內容，使用在人類偏好上訓練的獎勵模型來對齊它們。為了防止預訓練知識的遺忘，RLHF通常包含KL正則化；這迫使策略保持接近其監督微調初始化，儘管這會阻礙獎勵優化。為了應對KL和獎勵之間的權衡，本文介紹了一種名為加權平均獎勵策略（WARP）的新對齊策略。WARP在三個不同階段在權重空間中合併策略。首先，它使用策略的指數移動平均作為KL正則化中的動態錨點。其次，它應用球面插值將獨立微調的策略合併為一個新的增強策略。第三，它在這個合併模型和初始化之間進行線性插值，以恢復來自預訓練的特徵。然後，這個程序被迭代應用，每次迭代的最終模型被用作下一次的高級初始化，逐步優化KL-獎勵帕累托前緣，實現在固定KL下獲得更優獎勵。通過GEMMA策略的實驗驗證了WARP改善了它們的質量和對齊，勝過其他開源LLMs。

English

Reinforcement learning from human feedback (RLHF) aligns large language models (LLMs) by encouraging their generations to have high rewards, using a reward model trained on human preferences. To prevent the forgetting of pre-trained knowledge, RLHF usually incorporates a KL regularization; this forces the policy to remain close to its supervised fine-tuned initialization, though it hinders the reward optimization. To tackle the trade-off between KL and reward, in this paper we introduce a novel alignment strategy named Weight Averaged Rewarded Policies (WARP). WARP merges policies in the weight space at three distinct stages. First, it uses the exponential moving average of the policy as a dynamic anchor in the KL regularization. Second, it applies spherical interpolation to merge independently fine-tuned policies into a new enhanced one. Third, it linearly interpolates between this merged model and the initialization, to recover features from pre-training. This procedure is then applied iteratively, with each iteration's final model used as an advanced initialization for the next, progressively refining the KL-reward Pareto front, achieving superior rewards at fixed KL. Experiments with GEMMA policies validate that WARP improves their quality and alignment, outperforming other open-source LLMs.

WARP：權重平均獎勵策略的優勢

WARP: On the Benefits of Weight Averaged Rewarded Policies

摘要

Support