ChatPaper.aiChatPaper

M-RewardBench:在多语言环境中评估奖励模型

M-RewardBench: Evaluating Reward Models in Multilingual Settings

October 20, 2024
作者: Srishti Gureja, Lester James V. Miranda, Shayekh Bin Islam, Rishabh Maheshwary, Drishti Sharma, Gusti Winata, Nathan Lambert, Sebastian Ruder, Sara Hooker, Marzieh Fadaee
cs.AI

摘要

奖励模型(RMs)通过将人类反馈融入到语言建模过程中,推动了当今LLMs的最先进性能。然而,RMs主要在英语环境中进行训练和评估,它们在多语言环境中的能力仍然鲜为人知。在这项研究中,我们对多语言环境中的几种奖励模型进行了系统评估。我们首先构建了首个多语言RM评估基准M-RewardBench,其中包含23种语言的2.87k个偏好实例,测试了RMs的聊天、安全、推理和翻译能力。然后,我们对M-RewardBench上的多种奖励模型进行了严格评估,为我们提供了对它们在不同语言中表现的新见解。我们发现RMs在英语和非英语语言之间的表现存在显著差距,并且RM的偏好在不同语言之间可能会发生显著变化。我们还提出了关于不同多语言方面如何影响RM性能的几点发现。具体而言,我们发现RM的性能会随着翻译质量的提高而改善。同样,我们证明模型在高资源语言中表现更好。我们在本研究中发布了M-RewardBench数据集和代码库,以促进对多语言环境中RM评估的更好理解。
English
Reward models (RMs) have driven the state-of-the-art performance of LLMs today by enabling the integration of human feedback into the language modeling process. However, RMs are primarily trained and evaluated in English, and their capabilities in multilingual settings remain largely understudied. In this work, we conduct a systematic evaluation of several reward models in multilingual settings. We first construct the first-of-its-kind multilingual RM evaluation benchmark, M-RewardBench, consisting of 2.87k preference instances for 23 typologically diverse languages, that tests the chat, safety, reasoning, and translation capabilities of RMs. We then rigorously evaluate a wide range of reward models on M-RewardBench, offering fresh insights into their performance across diverse languages. We identify a significant gap in RMs' performances between English and non-English languages and show that RM preferences can change substantially from one language to another. We also present several findings on how different multilingual aspects impact RM performance. Specifically, we show that the performance of RMs is improved with improved translation quality. Similarly, we demonstrate that the models exhibit better performance for high-resource languages. We release M-RewardBench dataset and the codebase in this study to facilitate a better understanding of RM evaluation in multilingual settings.

Summary

AI-Generated Summary

PDF123November 16, 2024