ChatPaper.aiChatPaper

RM-Bench:使用微妙性和风格对语言模型的奖励模型进行基准测试

RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style

October 21, 2024
作者: Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, Juanzi Li
cs.AI

摘要

奖励模型在诸如人类反馈强化学习(RLHF)和推理缩放定律等技术中至关重要,它们指导语言模型对齐并选择最佳响应。尽管它们的重要性,现有的奖励模型基准常常通过要求模型区分不同强度模型生成的响应来评估模型。然而,这种方法未能评估奖励模型对于微妙但关键的内容变化和风格变化的敏感性,导致与策略模型性能的低相关性。为此,我们引入了RM-Bench,一个新颖的基准,旨在评估奖励模型对微妙内容差异的敏感性和对风格偏见的抵抗力。大量实验证明,RM-Bench与策略模型性能强相关,使其成为选择有效对齐语言模型的奖励模型的可靠参考。我们在RM-Bench上评估了近40个奖励模型。我们的结果显示,即使是最先进的模型在面对风格偏见干扰时,平均性能也仅为46.6%,低于随机水平准确率(50%)。这些发现突显了当前奖励模型有很大改进空间。相关代码和数据可在https://github.com/THU-KEG/RM-Bench找到。
English
Reward models are critical in techniques like Reinforcement Learning from Human Feedback (RLHF) and Inference Scaling Laws, where they guide language model alignment and select optimal responses. Despite their importance, existing reward model benchmarks often evaluate models by asking them to distinguish between responses generated by models of varying power. However, this approach fails to assess reward models on subtle but critical content changes and variations in style, resulting in a low correlation with policy model performance. To this end, we introduce RM-Bench, a novel benchmark designed to evaluate reward models based on their sensitivity to subtle content differences and resistance to style biases. Extensive experiments demonstrate that RM-Bench strongly correlates with policy model performance, making it a reliable reference for selecting reward models to align language models effectively. We evaluate nearly 40 reward models on RM-Bench. Our results reveal that even state-of-the-art models achieve an average performance of only 46.6%, which falls short of random-level accuracy (50%) when faced with style bias interference. These findings highlight the significant room for improvement in current reward models. Related code and data are available at https://github.com/THU-KEG/RM-Bench.

Summary

AI-Generated Summary

PDF242November 16, 2024