reWordBench: 変換された入力による報酬モデルのロバスト性のベンチマークと改善

要旨

報酬モデルは現代の自然言語処理において不可欠な存在となり、スケーラブルなテキスト評価ツールとしてだけでなく、多くのアライメント手法や推論時アルゴリズムにおいても重要な役割を果たしています。しかし、最近の報酬モデルは標準ベンチマークでの性能向上が見られるものの、その一部は過学習効果によるものであり、真の能力を正しく理解する上で混乱を招く可能性があります。本研究では、報酬モデルの堅牢性とそのような過学習の程度を精査します。私たちは、意味や順位を保持した方法で報酬モデルの入力を体系的に変換する**reWordBench**を構築しました。その結果、最先端の報酬モデルでさえ、わずかな入力変換によって大幅な性能低下を引き起こし、時にはランダム予測を下回る精度にまで落ち込むことが明らかになり、脆弱性が示唆されました。報酬モデルの堅牢性を向上させるため、私たちはパラフレーズに対して類似のスコアを割り当てるよう明示的に訓練することを提案し、このアプローチが他の種類の変換に対しても堅牢性を向上させることを発見しました。例えば、私たちの堅牢な報酬モデルは、RewardBenchのChat Hardサブセットにおいて、そのような性能低下を約半分に削減します。さらに、アライメントに使用した場合、私たちの堅牢な報酬モデルはより優れた有用性を示し、高品質な出力を導き、標準的に訓練されたRMに対して最大59%の事例で勝利しました。

English

Reward models have become a staple in modern NLP, serving as not only a scalable text evaluator, but also an indispensable component in many alignment recipes and inference-time algorithms. However, while recent reward models increase performance on standard benchmarks, this may partly be due to overfitting effects, which would confound an understanding of their true capability. In this work, we scrutinize the robustness of reward models and the extent of such overfitting. We build **reWordBench**, which systematically transforms reward model inputs in meaning- or ranking-preserving ways. We show that state-of-the-art reward models suffer from substantial performance degradation even with minor input transformations, sometimes dropping to significantly below-random accuracy, suggesting brittleness. To improve reward model robustness, we propose to explicitly train them to assign similar scores to paraphrases, and find that this approach also improves robustness to other distinct kinds of transformations. For example, our robust reward model reduces such degradation by roughly half for the Chat Hard subset in RewardBench. Furthermore, when used in alignment, our robust reward models demonstrate better utility and lead to higher-quality outputs, winning in up to 59% of instances against a standardly trained RM.

reWordBench: 変換された入力による報酬モデルのロバスト性のベンチマークと改善

reWordBench: Benchmarking and Improving the Robustness of Reward Models with Transformed Inputs

要旨

Support