RewardBench: 언어 모델링을 위한 보상 모델 평가

초록

보상 모델(RMs)은 사전 학습된 모델을 인간의 선호에 맞추기 위한 RLHF(Reinforcement Learning from Human Feedback)의 성공에 있어 핵심적인 역할을 합니다. 그러나 이러한 보상 모델의 평가에 초점을 맞춘 연구는 상대적으로 적었습니다. 보상 모델을 평가하는 것은 언어 모델 정렬에 사용되는 불투명한 기술을 이해하고, 그 안에 내재된 가치를 파악할 수 있는 기회를 제공합니다. 현재까지 능력, 훈련 방법 또는 오픈소스 보상 모델에 대한 설명은 매우 드뭅니다. 본 논문에서는 보상 모델에 대한 과학적 이해를 증진시키기 위해 평가용 벤치마크 데이터셋과 코드베이스인 RewardBench를 소개합니다. RewardBench 데이터셋은 채팅, 추론, 안전성에 걸친 프롬프트-승리-패배 삼중항으로 구성되어 있으며, 보상 모델이 도전적이고 구조화된 분포 외 쿼리에서 어떻게 수행되는지 벤치마킹합니다. 우리는 미묘하지만 검증 가능한 이유(예: 버그, 잘못된 사실)로 한 답변이 다른 답변보다 선호되어야 하는 보상 모델을 위한 특정 비교 데이터셋을 생성했습니다. RewardBench 리더보드에서는 분류기의 직접적인 MLE 훈련과 Direct Preference Optimization(DPO)의 암묵적 보상 모델링과 같은 다양한 방법으로 훈련된 보상 모델을 다양한 데이터셋에서 평가합니다. 우리는 다양한 보상 모델의 거부 성향, 추론 한계, 지시 따르기 부족 등에 대한 많은 발견을 제시하여 RLHF 프로세스에 대한 더 나은 이해를 도모합니다.

English

Reward models (RMs) are at the crux of successful RLHF to align pretrained models to human preferences, yet there has been relatively little study that focuses on evaluation of those reward models. Evaluating reward models presents an opportunity to understand the opaque technologies used for alignment of language models and which values are embedded in them. To date, very few descriptors of capabilities, training methods, or open-source reward models exist. In this paper, we present RewardBench, a benchmark dataset and code-base for evaluation, to enhance scientific understanding of reward models. The RewardBench dataset is a collection of prompt-win-lose trios spanning chat, reasoning, and safety, to benchmark how reward models perform on challenging, structured and out-of-distribution queries. We created specific comparison datasets for RMs that have subtle, but verifiable reasons (e.g. bugs, incorrect facts) why one answer should be preferred to another. On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods, such as the direct MLE training of classifiers and the implicit reward modeling of Direct Preference Optimization (DPO), and on a spectrum of datasets. We present many findings on propensity for refusals, reasoning limitations, and instruction following shortcomings of various reward models towards a better understanding of the RLHF process.

RewardBench: 언어 모델링을 위한 보상 모델 평가

RewardBench: Evaluating Reward Models for Language Modeling

초록

Support