VLRewardBench: ビジョン-言語生成報酬モデルのための厳しいベンチマーク

要旨

ビジョン言語生成報酬モデル（VL-GenRM）は、多様なAIシステムの整合性を図り評価する上で重要な役割を果たしていますが、その評価自体は未だに充分に探究されていません。現在の評価方法は、主に伝統的なVLタスクからのAI注釈付きの選好ラベルに依存しており、これには偏りをもたらす可能性があり、最先端のモデルに十分な挑戦を与えることができないことがよくあります。これらの制約に対処するために、我々はVL-RewardBenchを導入しました。これは、一般的な多モーダルクエリ、視覚幻覚の検出、および複雑な推論タスクを網羅する包括的なベンチマークです。AI支援の注釈付けパイプラインを通じて、サンプル選択と人間による検証を組み合わせ、モデルの制約を探るために特に設計された1,250の高品質な例を収集しました。16の主要な大規模ビジョン言語モデルにわたる包括的な評価は、VL-RewardBenchが厳しいテストベッドとしての効果を示しており、GPT-4oでさえ65.4％の精度しか達成できず、Qwen2-VL-72Bなどの最先端のオープンソースモデルもランダム推測を上回ることが難しいことが示されています。重要なことは、VL-RewardBenchでのパフォーマンスが、Best-of-Nサンプリングを用いたMMMU-Proの精度と強く相関していること（ピアソンのr > 0.9）です。分析実験により、VL-GenRMを改善するための3つの重要な洞察が明らかになりました：（i）モデルは主に推論タスクではなく基本的な視覚認識タスクで失敗していること、（ii）推論時のスケーリングの利点はモデル容量によって大きく異なること、および（iii）判断を学習させることでVL-GenRMを訓練すると、判断能力が著しく向上すること（7B VL-GenRMで+14.7％の精度向上）。我々は、VL-RewardBenchと実験的洞察が、VL-GenRMの進歩に貴重な資源となると信じています。

English

Vision-language generative reward models (VL-GenRMs) play a crucial role in aligning and evaluating multimodal AI systems, yet their own evaluation remains under-explored. Current assessment methods primarily rely on AI-annotated preference labels from traditional VL tasks, which can introduce biases and often fail to effectively challenge state-of-the-art models. To address these limitations, we introduce VL-RewardBench, a comprehensive benchmark spanning general multimodal queries, visual hallucination detection, and complex reasoning tasks. Through our AI-assisted annotation pipeline combining sample selection with human verification, we curate 1,250 high-quality examples specifically designed to probe model limitations. Comprehensive evaluation across 16 leading large vision-language models, demonstrates VL-RewardBench's effectiveness as a challenging testbed, where even GPT-4o achieves only 65.4% accuracy, and state-of-the-art open-source models such as Qwen2-VL-72B, struggle to surpass random-guessing. Importantly, performance on VL-RewardBench strongly correlates (Pearson's r > 0.9) with MMMU-Pro accuracy using Best-of-N sampling with VL-GenRMs. Analysis experiments uncover three critical insights for improving VL-GenRMs: (i) models predominantly fail at basic visual perception tasks rather than reasoning tasks; (ii) inference-time scaling benefits vary dramatically by model capacity; and (iii) training VL-GenRMs to learn to judge substantially boosts judgment capability (+14.7% accuracy for a 7B VL-GenRM). We believe VL-RewardBench along with the experimental insights will become a valuable resource for advancing VL-GenRMs.

VLRewardBench: ビジョン-言語生成報酬モデルのための厳しいベンチマーク

VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

要旨

Support