VERIFY：一个用于探究多模态推理保真度的视觉解释与推理基准

摘要

视觉推理是人类认知的核心，使个体能够解读并抽象地理解其周围环境。尽管最近的多模态大语言模型（MLLMs）在语言和视觉-语言任务中展现了令人瞩目的性能，但现有基准主要衡量基于识别的技能，未能充分评估真正的视觉推理能力。为弥合这一关键差距，我们引入了VERIFY，这是一个专门设计用于隔离并严格评估最先进MLLMs视觉推理能力的基准。VERIFY迫使模型主要依赖视觉信息进行推理，提供最少的文本上下文以减少对领域特定知识和语言偏见的依赖。每个问题都附有人工标注的推理路径，使其成为首个深入评估模型决策过程的基准。此外，我们提出了超越单纯准确性的新指标，用以评估视觉推理的忠实度，揭示当前模型推理模式中的关键不平衡。我们对领先MLLMs的全面基准测试揭示了显著局限性，强调了在感知与推理之间采取平衡且整体方法的必要性。更多预告与测试，请访问我们的项目页面（https://verify-eqh.pages.dev/）。

English

Visual reasoning is central to human cognition, enabling individuals to interpret and abstractly understand their environment. Although recent Multimodal Large Language Models (MLLMs) have demonstrated impressive performance across language and vision-language tasks, existing benchmarks primarily measure recognition-based skills and inadequately assess true visual reasoning capabilities. To bridge this critical gap, we introduce VERIFY, a benchmark explicitly designed to isolate and rigorously evaluate the visual reasoning capabilities of state-of-the-art MLLMs. VERIFY compels models to reason primarily from visual information, providing minimal textual context to reduce reliance on domain-specific knowledge and linguistic biases. Each problem is accompanied by a human-annotated reasoning path, making it the first to provide in-depth evaluation of model decision-making processes. Additionally, we propose novel metrics that assess visual reasoning fidelity beyond mere accuracy, highlighting critical imbalances in current model reasoning patterns. Our comprehensive benchmarking of leading MLLMs uncovers significant limitations, underscoring the need for a balanced and holistic approach to both perception and reasoning. For more teaser and testing, visit our project page (https://verify-eqh.pages.dev/).