WARP：权重平均奖励策略的益处

摘要

人类反馈强化学习（RLHF）通过鼓励大型语言模型（LLMs）生成高奖励来对齐它们，使用在人类偏好上训练的奖励模型。为了防止预训练知识的遗忘，RLHF通常会包含KL正则化；这迫使策略保持接近其监督微调初始化，尽管这会阻碍奖励优化。为了解决KL和奖励之间的权衡，本文介绍了一种名为加权平均奖励策略（WARP）的新对齐策略。WARP在三个不同阶段在权重空间中合并策略。首先，它使用策略的指数移动平均作为KL正则化中的动态锚点。其次，它应用球面插值将独立微调的策略合并为一个新的增强策略。第三，它在这个合并模型和初始化之间进行线性插值，以恢复来自预训练的特征。然后，这个过程被迭代应用，每次迭代的最终模型被用作下一个迭代的高级初始化，逐渐优化KL-奖励帕累托前沿，实现在固定KL下的卓越奖励。使用GEMMA策略的实验验证了WARP改善了它们的质量和对齐，优于其他开源LLMs。

English

Reinforcement learning from human feedback (RLHF) aligns large language models (LLMs) by encouraging their generations to have high rewards, using a reward model trained on human preferences. To prevent the forgetting of pre-trained knowledge, RLHF usually incorporates a KL regularization; this forces the policy to remain close to its supervised fine-tuned initialization, though it hinders the reward optimization. To tackle the trade-off between KL and reward, in this paper we introduce a novel alignment strategy named Weight Averaged Rewarded Policies (WARP). WARP merges policies in the weight space at three distinct stages. First, it uses the exponential moving average of the policy as a dynamic anchor in the KL regularization. Second, it applies spherical interpolation to merge independently fine-tuned policies into a new enhanced one. Third, it linearly interpolates between this merged model and the initialization, to recover features from pre-training. This procedure is then applied iteratively, with each iteration's final model used as an advanced initialization for the next, progressively refining the KL-reward Pareto front, achieving superior rewards at fixed KL. Experiments with GEMMA policies validate that WARP improves their quality and alignment, outperforming other open-source LLMs.

WARP：权重平均奖励策略的益处

WARP: On the Benefits of Weight Averaged Rewarded Policies

摘要

Support