帮助还是群体行为?奖励模型集合减轻但无法消除奖励欺骗。
Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking
December 14, 2023
作者: Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ahmad Beirami, Alex D'Amour, DJ Dvijotham, Adam Fisch, Katherine Heller, Stephen Pfohl, Deepak Ramachandran, Peter Shaw, Jonathan Berant
cs.AI
摘要
奖励模型在将语言模型应用程序与人类偏好保持一致方面起着关键作用。然而,这种设置会导致语言模型利用奖励模型中的错误来实现高估奖励的激励,这种现象通常被称为奖励欺骗。一种自然的缓解方法是训练一组奖励模型,聚合模型输出以获得更稳健的奖励估计。我们探讨了将奖励集成应用于训练时间(通过强化学习)和推断时间(通过重新排序)的对齐。首先,我们表明奖励模型是欠规范的:在分布中表现相似的奖励模型在对齐时可能产生非常不同的奖励,这是由于分布转移造成的。其次,欠规范导致过度优化,即对一个奖励模型的对齐并不会提高另一个在相同数据上训练的奖励模型所衡量的奖励。第三,过度优化通过使用奖励集成来缓解,通过其预训练种子不同的集成比仅通过微调种子不同的集成具有更好的泛化性能,两者均优于单个奖励模型。然而,即使是预训练奖励集成也无法消除奖励欺骗:我们展示了几种定性奖励欺骗现象,这些现象不受集成的缓解,因为集成中的所有奖励模型都表现出相似的错误模式。
English
Reward models play a key role in aligning language model applications towards
human preferences. However, this setup creates an incentive for the language
model to exploit errors in the reward model to achieve high estimated reward, a
phenomenon often termed reward hacking. A natural mitigation is to train
an ensemble of reward models, aggregating over model outputs to obtain a more
robust reward estimate. We explore the application of reward ensembles to
alignment at both training time (through reinforcement learning) and inference
time (through reranking). First, we show that reward models are
underspecified: reward models that perform similarly in-distribution can
yield very different rewards when used in alignment, due to distribution shift.
Second, underspecification results in overoptimization, where alignment to one
reward model does not improve reward as measured by another reward model
trained on the same data. Third, overoptimization is mitigated by the use of
reward ensembles, and ensembles that vary by their pretraining seeds
lead to better generalization than ensembles that differ only by their
fine-tuning seeds, with both outperforming individual reward models.
However, even pretrain reward ensembles do not eliminate reward hacking: we
show several qualitative reward hacking phenomena that are not mitigated by
ensembling because all reward models in the ensemble exhibit similar error
patterns.