ChatPaper.aiChatPaper

reWordBench:通过输入变换评估并提升奖励模型的鲁棒性

reWordBench: Benchmarking and Improving the Robustness of Reward Models with Transformed Inputs

March 14, 2025
作者: Zhaofeng Wu, Michihiro Yasunaga, Andrew Cohen, Yoon Kim, Asli Celikyilmaz, Marjan Ghazvininejad
cs.AI

摘要

奖励模型已成为现代自然语言处理(NLP)中的核心工具,不仅作为可扩展的文本评估器,更是众多对齐方案和推理时算法不可或缺的组成部分。然而,尽管近期奖励模型在标准基准测试上提升了性能,部分原因可能源于过拟合效应,这混淆了对其真实能力的理解。本研究深入探讨了奖励模型的鲁棒性及此类过拟合的程度。我们构建了**reWordBench**,系统性地以保持语义或排序不变的方式转换奖励模型的输入。研究表明,即便输入发生微小变化,最先进的奖励模型也会遭受显著的性能下降,有时甚至跌至远低于随机准确率的水平,暗示其脆弱性。为增强奖励模型的鲁棒性,我们提出明确训练它们为同义句赋予相似分数,并发现此方法同样提升了模型对其他类型转换的鲁棒性。例如,在RewardBench的Chat Hard子集上,我们的鲁棒奖励模型将此类性能下降减少了约一半。此外,当应用于对齐任务时,我们的鲁棒奖励模型展现出更优的实用性,并生成更高质量的输出,在高达59%的实例中胜过了标准训练的奖励模型。
English
Reward models have become a staple in modern NLP, serving as not only a scalable text evaluator, but also an indispensable component in many alignment recipes and inference-time algorithms. However, while recent reward models increase performance on standard benchmarks, this may partly be due to overfitting effects, which would confound an understanding of their true capability. In this work, we scrutinize the robustness of reward models and the extent of such overfitting. We build **reWordBench**, which systematically transforms reward model inputs in meaning- or ranking-preserving ways. We show that state-of-the-art reward models suffer from substantial performance degradation even with minor input transformations, sometimes dropping to significantly below-random accuracy, suggesting brittleness. To improve reward model robustness, we propose to explicitly train them to assign similar scores to paraphrases, and find that this approach also improves robustness to other distinct kinds of transformations. For example, our robust reward model reduces such degradation by roughly half for the Chat Hard subset in RewardBench. Furthermore, when used in alignment, our robust reward models demonstrate better utility and lead to higher-quality outputs, winning in up to 59% of instances against a standardly trained RM.

Summary

AI-Generated Summary

PDF162March 18, 2025