溫暖:關於加權平均獎勵模型的好處
WARM: On the Benefits of Weight Averaged Reward Models
January 22, 2024
作者: Alexandre Ramé, Nino Vieillard, Léonard Hussenot, Robert Dadashi, Geoffrey Cideron, Olivier Bachem, Johan Ferret
cs.AI
摘要
透過強化學習 (RLHF) 將大型語言模型 (LLMs) 與人類偏好對齊可能導致獎勵破解,即LLMs利用獎勵模型 (RM) 中的缺陷來獲得看似高獎勵,卻未達到潛在目標。我們在設計RM以減輕獎勵破解時識別到兩個主要挑戰:在RL過程中的分布變化和人類偏好的不一致性。作為解決方案,我們提出了加權平均獎勵模型 (WARM),首先對多個RM進行微調,然後在權重空間中將它們進行平均。這個策略遵循一個觀察,即在共享相同預訓練時,微調的權重保持線性模式連接。通過平均權重,WARM相較於傳統的預測集成方法提高了效率,同時在分布變化和偏好不一致性方面提高了可靠性。我們在摘要任務上進行的實驗,使用最佳N和RL方法,顯示WARM提高了LLM預測的整體質量和對齊性;例如,使用WARM微調的策略RL在與使用單個RM微調的策略RL對比時,勝率達到了79.4%。
English
Aligning large language models (LLMs) with human preferences through
reinforcement learning (RLHF) can lead to reward hacking, where LLMs exploit
failures in the reward model (RM) to achieve seemingly high rewards without
meeting the underlying objectives. We identify two primary challenges when
designing RMs to mitigate reward hacking: distribution shifts during the RL
process and inconsistencies in human preferences. As a solution, we propose
Weight Averaged Reward Models (WARM), first fine-tuning multiple RMs, then
averaging them in the weight space. This strategy follows the observation that
fine-tuned weights remain linearly mode connected when sharing the same
pre-training. By averaging weights, WARM improves efficiency compared to the
traditional ensembling of predictions, while improving reliability under
distribution shifts and robustness to preference inconsistencies. Our
experiments on summarization tasks, using best-of-N and RL methods, shows that
WARM improves the overall quality and alignment of LLM predictions; for
example, a policy RL fine-tuned with WARM has a 79.4% win rate against a policy
RL fine-tuned with a single RM.