RewardBench: 言語モデリングのための報酬モデル評価

要旨

報酬モデル（RMs）は、事前学習済みモデルを人間の好みに合わせるためのRLHF（人間によるフィードバックを用いた強化学習）の成功において重要な役割を果たしていますが、これらの報酬モデルの評価に焦点を当てた研究は比較的少ないのが現状です。報酬モデルの評価は、言語モデルのアラインメントに使用される不透明な技術や、それらに埋め込まれた価値観を理解する機会を提供します。これまで、能力の記述、トレーニング方法、またはオープンソースの報酬モデルに関する情報はほとんど存在していません。本論文では、報酬モデルの科学的理解を深めるためのベンチマークデータセットとコードベースであるRewardBenchを紹介します。RewardBenchデータセットは、チャット、推論、安全性にわたるプロンプト-勝利-敗北のトリオを集めたもので、報酬モデルが挑戦的で構造化された、分布外のクエリに対してどのように機能するかをベンチマークします。私たちは、微妙ではあるが検証可能な理由（例：バグ、誤った事実）で一方の回答が他方よりも好まれるべきである特定の比較データセットを報酬モデル用に作成しました。RewardBenchリーダーボードでは、分類器の直接的なMLE（最尤推定）トレーニングやDirect Preference Optimization（DPO）の暗黙的な報酬モデリングなど、さまざまな方法でトレーニングされた報酬モデルを、多様なデータセットで評価します。私たちは、拒否の傾向、推論の限界、指示追従の欠点など、さまざまな報酬モデルの特性について多くの知見を提示し、RLHFプロセスのより良い理解に向けて貢献します。

English

Reward models (RMs) are at the crux of successful RLHF to align pretrained models to human preferences, yet there has been relatively little study that focuses on evaluation of those reward models. Evaluating reward models presents an opportunity to understand the opaque technologies used for alignment of language models and which values are embedded in them. To date, very few descriptors of capabilities, training methods, or open-source reward models exist. In this paper, we present RewardBench, a benchmark dataset and code-base for evaluation, to enhance scientific understanding of reward models. The RewardBench dataset is a collection of prompt-win-lose trios spanning chat, reasoning, and safety, to benchmark how reward models perform on challenging, structured and out-of-distribution queries. We created specific comparison datasets for RMs that have subtle, but verifiable reasons (e.g. bugs, incorrect facts) why one answer should be preferred to another. On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods, such as the direct MLE training of classifiers and the implicit reward modeling of Direct Preference Optimization (DPO), and on a spectrum of datasets. We present many findings on propensity for refusals, reasoning limitations, and instruction following shortcomings of various reward models towards a better understanding of the RLHF process.

RewardBench: 言語モデリングのための報酬モデル評価

RewardBench: Evaluating Reward Models for Language Modeling

要旨

Support