溫暖：關於加權平均獎勵模型的好處

摘要

透過強化學習 (RLHF) 將大型語言模型 (LLMs) 與人類偏好對齊可能導致獎勵破解，即LLMs利用獎勵模型 (RM) 中的缺陷來獲得看似高獎勵，卻未達到潛在目標。我們在設計RM以減輕獎勵破解時識別到兩個主要挑戰：在RL過程中的分布變化和人類偏好的不一致性。作為解決方案，我們提出了加權平均獎勵模型 (WARM)，首先對多個RM進行微調，然後在權重空間中將它們進行平均。這個策略遵循一個觀察，即在共享相同預訓練時，微調的權重保持線性模式連接。通過平均權重，WARM相較於傳統的預測集成方法提高了效率，同時在分布變化和偏好不一致性方面提高了可靠性。我們在摘要任務上進行的實驗，使用最佳N和RL方法，顯示WARM提高了LLM預測的整體質量和對齊性；例如，使用WARM微調的策略RL在與使用單個RM微調的策略RL對比時，勝率達到了79.4%。

English

Aligning large language models (LLMs) with human preferences through reinforcement learning (RLHF) can lead to reward hacking, where LLMs exploit failures in the reward model (RM) to achieve seemingly high rewards without meeting the underlying objectives. We identify two primary challenges when designing RMs to mitigate reward hacking: distribution shifts during the RL process and inconsistencies in human preferences. As a solution, we propose Weight Averaged Reward Models (WARM), first fine-tuning multiple RMs, then averaging them in the weight space. This strategy follows the observation that fine-tuned weights remain linearly mode connected when sharing the same pre-training. By averaging weights, WARM improves efficiency compared to the traditional ensembling of predictions, while improving reliability under distribution shifts and robustness to preference inconsistencies. Our experiments on summarization tasks, using best-of-N and RL methods, shows that WARM improves the overall quality and alignment of LLM predictions; for example, a policy RL fine-tuned with WARM has a 79.4% win rate against a policy RL fine-tuned with a single RM.

溫暖：關於加權平均獎勵模型的好處

WARM: On the Benefits of Weight Averaged Reward Models

摘要

Support