助けるのか、それとも誘導するのか？報酬モデルのアンサンブルは報酬ハッキングを緩和するが、完全には排除しない

要旨

報酬モデルは、言語モデルのアプリケーションを人間の好みに合わせる上で重要な役割を果たします。しかし、この設定は言語モデルが報酬モデルの誤りを利用して高い推定報酬を達成するインセンティブを生み出し、この現象はしばしば「報酬ハッキング」と呼ばれます。この問題を緩和する自然な方法は、複数の報酬モデルをアンサンブルし、モデルの出力を集約してよりロバストな報酬推定を得ることです。本論文では、報酬アンサンブルを訓練時（強化学習を通じて）と推論時（リランキングを通じて）の両方でのアライメントに適用する方法を探ります。まず、報酬モデルが未特定化されていることを示します：分布内で同様の性能を示す報酬モデルでも、分布シフトが起こるとアライメントに使用した際に非常に異なる報酬を生成することがあります。次に、この未特定化は過最適化を引き起こし、一つの報酬モデルに対するアライメントが、同じデータで訓練された別の報酬モデルで測定した報酬を改善しないことを示します。第三に、報酬アンサンブルの使用によって過最適化が緩和され、事前学習のシードによって異なるアンサンブルは、ファインチューニングのシードのみが異なるアンサンブルよりも一般化性能が高く、どちらも単一の報酬モデルを上回ります。しかし、事前学習の報酬アンサンブルでさえ報酬ハッキングを完全に排除することはできません：アンサンブル内のすべての報酬モデルが同様の誤りパターンを示すため、アンサンブルによって緩和されないいくつかの定性的な報酬ハッキング現象を示します。

English

Reward models play a key role in aligning language model applications towards human preferences. However, this setup creates an incentive for the language model to exploit errors in the reward model to achieve high estimated reward, a phenomenon often termed reward hacking. A natural mitigation is to train an ensemble of reward models, aggregating over model outputs to obtain a more robust reward estimate. We explore the application of reward ensembles to alignment at both training time (through reinforcement learning) and inference time (through reranking). First, we show that reward models are underspecified: reward models that perform similarly in-distribution can yield very different rewards when used in alignment, due to distribution shift. Second, underspecification results in overoptimization, where alignment to one reward model does not improve reward as measured by another reward model trained on the same data. Third, overoptimization is mitigated by the use of reward ensembles, and ensembles that vary by their pretraining seeds lead to better generalization than ensembles that differ only by their fine-tuning seeds, with both outperforming individual reward models. However, even pretrain reward ensembles do not eliminate reward hacking: we show several qualitative reward hacking phenomena that are not mitigated by ensembling because all reward models in the ensemble exhibit similar error patterns.

助けるのか、それとも誘導するのか？報酬モデルのアンサンブルは報酬ハッキングを緩和するが、完全には排除しない

Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking

要旨

Support