ChatPaper.aiChatPaper

RewardBench:评估语言建模的奖励模型

RewardBench: Evaluating Reward Models for Language Modeling

March 20, 2024
作者: Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, Hannaneh Hajishirzi
cs.AI

摘要

奖励模型(RMs)是成功的RLHF的关键,可以将预训练模型与人类偏好对齐,然而对这些奖励模型的评估研究相对较少。评估奖励模型提供了了解用于对齐语言模型的不透明技术及其中嵌入的价值观的机会。迄今为止,对于能力、训练方法或开源奖励模型的描述非常有限。本文提出了RewardBench,一个用于评估的基准数据集和代码库,以增进对奖励模型的科学理解。RewardBench数据集是一个涵盖聊天、推理和安全性的提示-赢-输三元组集合,用于评估奖励模型在具有挑战性、结构化和超出分布范围的查询上的表现。我们为具有微妙但可验证原因(例如错误、不正确事实)的RMs创建了特定的比较数据集,以便确定为何应优选一个答案而不是另一个。在RewardBench排行榜上,我们评估了通过各种方法训练的奖励模型,例如直接MLE训练分类器和Direct Preference Optimization(DPO)的隐式奖励建模,并在一系列数据集上进行评估。我们提出了许多关于拒绝倾向、推理限制以及各种奖励模型在RLHF过程中遵循指令的缺陷的发现,以更好地理解RLHF过程。
English
Reward models (RMs) are at the crux of successful RLHF to align pretrained models to human preferences, yet there has been relatively little study that focuses on evaluation of those reward models. Evaluating reward models presents an opportunity to understand the opaque technologies used for alignment of language models and which values are embedded in them. To date, very few descriptors of capabilities, training methods, or open-source reward models exist. In this paper, we present RewardBench, a benchmark dataset and code-base for evaluation, to enhance scientific understanding of reward models. The RewardBench dataset is a collection of prompt-win-lose trios spanning chat, reasoning, and safety, to benchmark how reward models perform on challenging, structured and out-of-distribution queries. We created specific comparison datasets for RMs that have subtle, but verifiable reasons (e.g. bugs, incorrect facts) why one answer should be preferred to another. On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods, such as the direct MLE training of classifiers and the implicit reward modeling of Direct Preference Optimization (DPO), and on a spectrum of datasets. We present many findings on propensity for refusals, reasoning limitations, and instruction following shortcomings of various reward models towards a better understanding of the RLHF process.

Summary

AI-Generated Summary

PDF232December 15, 2024