VERIFY: マルチモーダル推論の忠実性を調査するための視覚的説明と推論のベンチマーク

要旨

視覚的推論は人間の認知の中核をなすものであり、個人が環境を解釈し抽象的に理解することを可能にします。近年のマルチモーダル大規模言語モデル（MLLM）は、言語および視覚-言語タスクにおいて印象的な性能を示していますが、既存のベンチマークは主に認識ベースのスキルを測定するものであり、真の視覚的推論能力を十分に評価できていません。この重要なギャップを埋めるため、我々はVERIFYを導入します。これは、最先端のMLLMの視覚的推論能力を分離し厳密に評価するために明示的に設計されたベンチマークです。VERIFYは、モデルに視覚情報を主に基に推論させることで、ドメイン固有の知識や言語的バイアスへの依存を減らすために最小限のテキストコンテキストを提供します。各問題には人間による注釈付きの推論パスが付属しており、モデルの意思決定プロセスを詳細に評価する初のベンチマークとなっています。さらに、単なる精度を超えた視覚的推論の忠実度を評価する新しい指標を提案し、現在のモデルの推論パターンにおける重要な不均衡を浮き彫りにします。主要なMLLMの包括的なベンチマークを通じて、知覚と推論の両方に対するバランスの取れた包括的なアプローチの必要性を強調する重要な限界が明らかになりました。詳細なティーザーやテストについては、プロジェクトページ（https://verify-eqh.pages.dev/）をご覧ください。

English

Visual reasoning is central to human cognition, enabling individuals to interpret and abstractly understand their environment. Although recent Multimodal Large Language Models (MLLMs) have demonstrated impressive performance across language and vision-language tasks, existing benchmarks primarily measure recognition-based skills and inadequately assess true visual reasoning capabilities. To bridge this critical gap, we introduce VERIFY, a benchmark explicitly designed to isolate and rigorously evaluate the visual reasoning capabilities of state-of-the-art MLLMs. VERIFY compels models to reason primarily from visual information, providing minimal textual context to reduce reliance on domain-specific knowledge and linguistic biases. Each problem is accompanied by a human-annotated reasoning path, making it the first to provide in-depth evaluation of model decision-making processes. Additionally, we propose novel metrics that assess visual reasoning fidelity beyond mere accuracy, highlighting critical imbalances in current model reasoning patterns. Our comprehensive benchmarking of leading MLLMs uncovers significant limitations, underscoring the need for a balanced and holistic approach to both perception and reasoning. For more teaser and testing, visit our project page (https://verify-eqh.pages.dev/).

VERIFY: マルチモーダル推論の忠実性を調査するための視覚的説明と推論のベンチマーク

VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity

要旨

Support