ChatPaper.aiChatPaper

RewardBench:評估語言建模的獎勵模型

RewardBench: Evaluating Reward Models for Language Modeling

March 20, 2024
作者: Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, Hannaneh Hajishirzi
cs.AI

摘要

獎勵模型(RMs)是成功的強化學習人類友好(RLHF)的關鍵所在,以調整預訓練模型以符合人類偏好,然而對於這些獎勵模型的評估相對較少被研究。評估獎勵模型提供了一個機會,可以了解用於調整語言模型的不透明技術,以及其中蘊含的價值觀。迄今為止,幾乎沒有能描述能力、訓練方法或開源獎勵模型的描述存在。在本文中,我們提出了RewardBench,這是一個用於評估的基準數據集和代碼庫,以增進對獎勵模型的科學理解。RewardBench數據集是一個包含聊天、推理和安全性的提示-贏-輸三元組的集合,用於評估獎勵模型在具有挑戰性、結構化和超出分佈範圍的查詢上的表現。我們為具有微妙但可驗證原因(例如錯誤事實)的RMs創建了特定的比較數據集,以說明為何應該優先選擇一個答案。在RewardBench排行榜上,我們評估了使用各種方法訓練的獎勵模型,例如直接MLE訓練分類器和直接偏好優化(DPO)的隱式獎勵建模,以及一系列數據集。我們提出了許多關於拒絕傾向、推理限制以及各種獎勵模型對於更好理解RLHF過程的指示遵循缺陷的發現。
English
Reward models (RMs) are at the crux of successful RLHF to align pretrained models to human preferences, yet there has been relatively little study that focuses on evaluation of those reward models. Evaluating reward models presents an opportunity to understand the opaque technologies used for alignment of language models and which values are embedded in them. To date, very few descriptors of capabilities, training methods, or open-source reward models exist. In this paper, we present RewardBench, a benchmark dataset and code-base for evaluation, to enhance scientific understanding of reward models. The RewardBench dataset is a collection of prompt-win-lose trios spanning chat, reasoning, and safety, to benchmark how reward models perform on challenging, structured and out-of-distribution queries. We created specific comparison datasets for RMs that have subtle, but verifiable reasons (e.g. bugs, incorrect facts) why one answer should be preferred to another. On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods, such as the direct MLE training of classifiers and the implicit reward modeling of Direct Preference Optimization (DPO), and on a spectrum of datasets. We present many findings on propensity for refusals, reasoning limitations, and instruction following shortcomings of various reward models towards a better understanding of the RLHF process.

Summary

AI-Generated Summary

PDF232December 15, 2024