温暖：关于加权平均奖励模型的益处

摘要

通过强化学习（RLHF）将大型语言模型（LLMs）与人类偏好进行对齐可能导致奖励破解，即LLMs利用奖励模型（RM）中的缺陷来实现看似高奖励而不符合基本目标。我们确定了设计RM以减少奖励破解时面临的两个主要挑战：RL过程中的分布转移和人类偏好的不一致性。作为解决方案，我们提出了加权平均奖励模型（WARM），首先对多个RM进行微调，然后在权重空间中对它们进行平均。这一策略遵循了这样一个观察结果，即在共享相同预训练时，微调后的权重保持线性模式连接。通过对权重进行平均，WARM相对于传统的预测集成方法提高了效率，同时在面对分布转移和偏好不一致性时提高了可靠性。我们在摘要任务上进行的实验，使用最佳N和RL方法，表明WARM提高了LLM预测的整体质量和对齐性；例如，使用WARM进行微调的策略RL在与使用单个RM进行微调的策略RL对比中获胜率为79.4%。

English

Aligning large language models (LLMs) with human preferences through reinforcement learning (RLHF) can lead to reward hacking, where LLMs exploit failures in the reward model (RM) to achieve seemingly high rewards without meeting the underlying objectives. We identify two primary challenges when designing RMs to mitigate reward hacking: distribution shifts during the RL process and inconsistencies in human preferences. As a solution, we propose Weight Averaged Reward Models (WARM), first fine-tuning multiple RMs, then averaging them in the weight space. This strategy follows the observation that fine-tuned weights remain linearly mode connected when sharing the same pre-training. By averaging weights, WARM improves efficiency compared to the traditional ensembling of predictions, while improving reliability under distribution shifts and robustness to preference inconsistencies. Our experiments on summarization tasks, using best-of-N and RL methods, shows that WARM improves the overall quality and alignment of LLM predictions; for example, a policy RL fine-tuned with WARM has a 79.4% win rate against a policy RL fine-tuned with a single RM.

温暖：关于加权平均奖励模型的益处

WARM: On the Benefits of Weight Averaged Reward Models

摘要

Support