RRM:强化奖励模型训练有助于减轻奖励篡改
RRM: Robust Reward Model Training Mitigates Reward Hacking
September 20, 2024
作者: Tianqi Liu, Wei Xiong, Jie Ren, Lichang Chen, Junru Wu, Rishabh Joshi, Yang Gao, Jiaming Shen, Zhen Qin, Tianhe Yu, Daniel Sohn, Anastasiia Makarova, Jeremiah Liu, Yuan Liu, Bilal Piot, Abe Ittycheriah, Aviral Kumar, Mohammad Saleh
cs.AI
摘要
奖励模型(RMs)在将大型语言模型(LLMs)与人类偏好保持一致方面发挥着关键作用。然而,传统的奖励模型训练依赖于与特定提示相关联的响应对,很难将由响应长度和格式等提示无关因素引起的偏好与提示驱动的偏好区分开来。在这项工作中,我们揭示了当前奖励模型训练方法的一个根本局限,即当确定偏好时,奖励模型未能有效区分上下文信号和无关因素。为了解决这一问题,我们引入了一个因果框架,该框架学习与这些无关因素无关的偏好,并提出了一种旨在消除这些因素的新颖数据增强技术。大量实验证明,我们的方法成功地滤除了不良因素,产生了更强大的奖励模型(RRM)。我们的RRM提高了在RewardBench上训练的Gemma-2-9b-it上的成对奖励模型的性能,将准确率从80.61%提高到84.15%。此外,我们使用RM和RRM训练了两个DPO策略,结果表明RRM显著增强了DPO对齐策略,将MT-Bench得分从7.27提高到8.31,并将AlpacaEval-2中的长度控制胜率从33.46%提高到52.49%。
English
Reward models (RMs) play a pivotal role in aligning large language models
(LLMs) with human preferences. However, traditional RM training, which relies
on response pairs tied to specific prompts, struggles to disentangle
prompt-driven preferences from prompt-independent artifacts, such as response
length and format. In this work, we expose a fundamental limitation of current
RM training methods, where RMs fail to effectively distinguish between
contextual signals and irrelevant artifacts when determining preferences. To
address this, we introduce a causal framework that learns preferences
independent of these artifacts and propose a novel data augmentation technique
designed to eliminate them. Extensive experiments show that our approach
successfully filters out undesirable artifacts, yielding a more robust reward
model (RRM). Our RRM improves the performance of a pairwise reward model
trained on Gemma-2-9b-it, on RewardBench, increasing accuracy from 80.61% to
84.15%. Additionally, we train two DPO policies using both the RM and RRM,
demonstrating that the RRM significantly enhances DPO-aligned policies,
improving MT-Bench scores from 7.27 to 8.31 and length-controlled win-rates in
AlpacaEval-2 from 33.46% to 52.49%.Summary
AI-Generated Summary