IF-RewardBench：命令追従評価のための審査モデルベンチマーク

要旨

命令追従は大規模言語モデル（LLM）の基礎的な能力であり、その改善は評価モデルからのスケーラブルで正確なフィードバックに依存している。しかし、既存のメタ評価ベンチマークには、データカバレッジの不足や、モデル最適化シナリオと整合しない過度に単純化されたペアワイズ評価パラダイムといった課題があるため、現在の評価モデルの命令追従における信頼性は十分に検証されていない。この問題に対処するため、我々は多様な命令タイプと制約タイプを網羅した包括的な命令追従メタ評価ベンチマークIF-RewardBenchを提案する。各命令に対して、命令追従の質に基づく複数応答間の全ペアワイズ選好関係を含む選好グラフを構築する。この設計により、モデルアライメントの指導に不可欠な、複数応答を順位付けする評価モデルの能力を評価するリストワイズ評価パラダイムを実現する。IF-RewardBenchを用いた大規模実験により、現在の評価モデルに重大な欠陥があることが明らかになり、既存ベンチマークと比較して下流タスク性能とのより強い正の相関が達成されることを示す。コードとデータはhttps://github.com/thu-coai/IF-RewardBenchで公開している。

English

Instruction-following is a foundational capability of large language models (LLMs), with its improvement hinging on scalable and accurate feedback from judge models. However, the reliability of current judge models in instruction-following remains underexplored due to several deficiencies of existing meta-evaluation benchmarks, such as their insufficient data coverage and oversimplified pairwise evaluation paradigms that misalign with model optimization scenarios. To this end, we propose IF-RewardBench, a comprehensive meta-evaluation benchmark for instruction-following that covers diverse instruction and constraint types. For each instruction, we construct a preference graph containing all pairwise preferences among multiple responses based on instruction-following quality. This design enables a listwise evaluation paradigm that assesses the capabilities of judge models to rank multiple responses, which is essential in guiding model alignment. Extensive experiments on IF-RewardBench reveal significant deficiencies in current judge models and demonstrate that our benchmark achieves a stronger positive correlation with downstream task performance compared to existing benchmarks. Our codes and data are available at https://github.com/thu-coai/IF-RewardBench.

IF-RewardBench：命令追従評価のための審査モデルベンチマーク

IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation

要旨

Support