ChatPaper.aiChatPaper

reWordBench:通過轉換輸入來評估並提升獎勵模型的魯棒性

reWordBench: Benchmarking and Improving the Robustness of Reward Models with Transformed Inputs

March 14, 2025
作者: Zhaofeng Wu, Michihiro Yasunaga, Andrew Cohen, Yoon Kim, Asli Celikyilmaz, Marjan Ghazvininejad
cs.AI

摘要

獎勵模型已成為現代自然語言處理中的核心工具,不僅作為可擴展的文本評估器,更是許多對齊方案和推理時算法中不可或缺的組成部分。然而,儘管近期的獎勵模型在標準基準測試中提升了表現,這可能部分歸因於過擬合效應,這將混淆對其真實能力的理解。在本研究中,我們深入探討了獎勵模型的魯棒性以及此類過擬合的程度。我們構建了**reWordBench**,該工具系統性地以保持語意或排序不變的方式轉換獎勵模型的輸入。我們發現,即使面對微小的輸入轉換,最先進的獎勵模型也會遭受顯著的性能下降,有時甚至降至顯著低於隨機的準確率,表明其脆弱性。為了提升獎勵模型的魯棒性,我們提出明確訓練它們為同義句分配相似分數的方法,並發現這種方法也能提升對其他不同類型轉換的魯棒性。例如,在RewardBench的Chat Hard子集中,我們的魯棒獎勵模型將此類性能下降減少了約一半。此外,當應用於對齊時,我們的魯棒獎勵模型展現出更好的效用,並產生了更高質量的輸出,在與標準訓練的獎勵模型對比中,最高可贏得59%的實例。
English
Reward models have become a staple in modern NLP, serving as not only a scalable text evaluator, but also an indispensable component in many alignment recipes and inference-time algorithms. However, while recent reward models increase performance on standard benchmarks, this may partly be due to overfitting effects, which would confound an understanding of their true capability. In this work, we scrutinize the robustness of reward models and the extent of such overfitting. We build **reWordBench**, which systematically transforms reward model inputs in meaning- or ranking-preserving ways. We show that state-of-the-art reward models suffer from substantial performance degradation even with minor input transformations, sometimes dropping to significantly below-random accuracy, suggesting brittleness. To improve reward model robustness, we propose to explicitly train them to assign similar scores to paraphrases, and find that this approach also improves robustness to other distinct kinds of transformations. For example, our robust reward model reduces such degradation by roughly half for the Chat Hard subset in RewardBench. Furthermore, when used in alignment, our robust reward models demonstrate better utility and lead to higher-quality outputs, winning in up to 59% of instances against a standardly trained RM.

Summary

AI-Generated Summary

PDF162March 18, 2025