RM-Bench：使用微妙性和風格對語言模型的獎勵模型進行基準測試

摘要

在技術領域中，獎勵模型在諸如從人類反饋中進行強化學習（RLHF）和推理擴展定律等技術中至關重要，它們引導語言模型的對齊並選擇最佳回應。儘管其重要性，現有的獎勵模型基準往往通過要求模型區分由不同能力模型生成的回應來評估模型。然而，這種方法無法評估獎勵模型對微妙但關鍵的內容變化和風格變化的敏感性，導致其與策略模型性能之間的相關性較低。為此，我們引入了RM-Bench，一個新穎的基準，旨在評估獎勵模型對微妙內容差異的敏感性和對風格偏見的抵抗力。大量實驗表明，RM-Bench與策略模型性能密切相關，使其成為選擇有效對齊語言模型的獎勵模型的可靠參考。我們在RM-Bench上評估了近40個獎勵模型。我們的結果顯示，即使是最先進的模型在面對風格偏見干擾時，平均性能也僅達到46.6％，低於隨機水平準確度（50％）。這些發現突顯了當前獎勵模型有很大改進空間。相關代碼和數據可在https://github.com/THU-KEG/RM-Bench找到。

English

Reward models are critical in techniques like Reinforcement Learning from Human Feedback (RLHF) and Inference Scaling Laws, where they guide language model alignment and select optimal responses. Despite their importance, existing reward model benchmarks often evaluate models by asking them to distinguish between responses generated by models of varying power. However, this approach fails to assess reward models on subtle but critical content changes and variations in style, resulting in a low correlation with policy model performance. To this end, we introduce RM-Bench, a novel benchmark designed to evaluate reward models based on their sensitivity to subtle content differences and resistance to style biases. Extensive experiments demonstrate that RM-Bench strongly correlates with policy model performance, making it a reliable reference for selecting reward models to align language models effectively. We evaluate nearly 40 reward models on RM-Bench. Our results reveal that even state-of-the-art models achieve an average performance of only 46.6%, which falls short of random-level accuracy (50%) when faced with style bias interference. These findings highlight the significant room for improvement in current reward models. Related code and data are available at https://github.com/THU-KEG/RM-Bench.

RM-Bench：使用微妙性和風格對語言模型的獎勵模型進行基準測試

RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style

摘要

Support