幫助還是趕羊?獎勵模型集合可減輕但無法消除獎勵入侵。
Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking
December 14, 2023
作者: Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ahmad Beirami, Alex D'Amour, DJ Dvijotham, Adam Fisch, Katherine Heller, Stephen Pfohl, Deepak Ramachandran, Peter Shaw, Jonathan Berant
cs.AI
摘要
獎勵模型在對齊語言模型應用程序與人類偏好之間扮演著關鍵角色。然而,這種設置會激勵語言模型利用獎勵模型中的錯誤,以獲得高估獎勵,這種現象通常被稱為獎勵破解。一種自然的緩解方法是訓練一組獎勵模型,對模型輸出進行匯總,以獲得更穩健的獎勵估計。我們探討了將獎勵集成應用於訓練時間(通過強化學習)和推理時間(通過重新排序)的對齊。首先,我們指出獎勵模型存在欠定義問題:在分布內表現相似的獎勵模型在對齊時可能產生非常不同的獎勵,這是由於分布轉移引起的。其次,欠定義導致過度優化,對一個獎勵模型的對齊並不會提高獎勵,如同另一個在相同數據上訓練的獎勵模型所衡量的那樣。第三,過度優化可以通過使用獎勵集成來緩解,通過其預訓練種子不同的集成比僅通過微調種子不同的集成具有更好的泛化性能,而兩者均優於單個獎勵模型。然而,即使預訓練獎勵集成也無法消除獎勵破解:我們展示了幾種質性獎勵破解現象,這些現象通過集成也無法緩解,因為集成中的所有獎勵模型都表現出相似的錯誤模式。
English
Reward models play a key role in aligning language model applications towards
human preferences. However, this setup creates an incentive for the language
model to exploit errors in the reward model to achieve high estimated reward, a
phenomenon often termed reward hacking. A natural mitigation is to train
an ensemble of reward models, aggregating over model outputs to obtain a more
robust reward estimate. We explore the application of reward ensembles to
alignment at both training time (through reinforcement learning) and inference
time (through reranking). First, we show that reward models are
underspecified: reward models that perform similarly in-distribution can
yield very different rewards when used in alignment, due to distribution shift.
Second, underspecification results in overoptimization, where alignment to one
reward model does not improve reward as measured by another reward model
trained on the same data. Third, overoptimization is mitigated by the use of
reward ensembles, and ensembles that vary by their pretraining seeds
lead to better generalization than ensembles that differ only by their
fine-tuning seeds, with both outperforming individual reward models.
However, even pretrain reward ensembles do not eliminate reward hacking: we
show several qualitative reward hacking phenomena that are not mitigated by
ensembling because all reward models in the ensemble exhibit similar error
patterns.