WARM: 重み付き平均報酬モデルの利点について

要旨

大規模言語モデル（LLM）を人間の選好に合わせるために強化学習（RLHF）を用いる場合、報酬モデル（RM）の欠陥を利用して、本来の目的を達成せずに表面的に高い報酬を得ようとする「報酬ハッキング」が発生する可能性があります。報酬ハッキングを軽減するためのRM設計において、我々は2つの主要な課題を特定しました：RLプロセス中の分布シフトと、人間の選好の不整合です。これらの課題に対する解決策として、我々は「重み平均化報酬モデル（WARM）」を提案します。WARMでは、まず複数のRMをファインチューニングし、その後それらの重みを重み空間で平均化します。この戦略は、同じ事前学習を共有する場合、ファインチューニングされた重みが線形モード接続性を保つという観察に基づいています。重みを平均化することで、WARMは従来の予測のアンサンブルに比べて効率性を向上させると同時に、分布シフト下での信頼性と選好の不整合に対する頑健性を高めます。要約タスクにおける実験では、best-of-N法とRL法を用いて、WARMがLLMの予測の全体的な品質と整合性を向上させることを示しています。例えば、WARMでRLファインチューニングされたポリシーは、単一のRMでRLファインチューニングされたポリシーに対して79.4%の勝率を達成しました。

English

Aligning large language models (LLMs) with human preferences through reinforcement learning (RLHF) can lead to reward hacking, where LLMs exploit failures in the reward model (RM) to achieve seemingly high rewards without meeting the underlying objectives. We identify two primary challenges when designing RMs to mitigate reward hacking: distribution shifts during the RL process and inconsistencies in human preferences. As a solution, we propose Weight Averaged Reward Models (WARM), first fine-tuning multiple RMs, then averaging them in the weight space. This strategy follows the observation that fine-tuned weights remain linearly mode connected when sharing the same pre-training. By averaging weights, WARM improves efficiency compared to the traditional ensembling of predictions, while improving reliability under distribution shifts and robustness to preference inconsistencies. Our experiments on summarization tasks, using best-of-N and RL methods, shows that WARM improves the overall quality and alignment of LLM predictions; for example, a policy RL fine-tuned with WARM has a 79.4% win rate against a policy RL fine-tuned with a single RM.

WARM: 重み付き平均報酬モデルの利点について

WARM: On the Benefits of Weight Averaged Reward Models

要旨

Support