VERIFY：一個用於探究多模態推理保真度的視覺解釋與推理基準

摘要

視覺推理是人類認知的核心，使個體能夠解釋並抽象地理解其環境。儘管近期的多模態大型語言模型（MLLMs）在語言和視覺-語言任務中展現了令人印象深刻的表現，現有的基準測試主要衡量基於識別的技能，未能充分評估真正的視覺推理能力。為彌補這一關鍵差距，我們引入了VERIFY，這是一個專門設計來隔離並嚴格評估最先進MLLMs視覺推理能力的基準測試。VERIFY迫使模型主要從視覺信息進行推理，提供最少的文本上下文，以減少對領域特定知識和語言偏見的依賴。每個問題都伴隨著人工註釋的推理路徑，使其成為首個深入評估模型決策過程的基準。此外，我們提出了新穎的指標，這些指標超越了單純的準確性，評估視覺推理的真實性，突顯了當前模型推理模式中的關鍵不平衡。我們對領先MLLMs的全面基準測試揭示了顯著的局限性，強調了在感知和推理方面需要採取平衡且全面的方法。欲了解更多預覽和測試，請訪問我們的項目頁面（https://verify-eqh.pages.dev/）。

English

Visual reasoning is central to human cognition, enabling individuals to interpret and abstractly understand their environment. Although recent Multimodal Large Language Models (MLLMs) have demonstrated impressive performance across language and vision-language tasks, existing benchmarks primarily measure recognition-based skills and inadequately assess true visual reasoning capabilities. To bridge this critical gap, we introduce VERIFY, a benchmark explicitly designed to isolate and rigorously evaluate the visual reasoning capabilities of state-of-the-art MLLMs. VERIFY compels models to reason primarily from visual information, providing minimal textual context to reduce reliance on domain-specific knowledge and linguistic biases. Each problem is accompanied by a human-annotated reasoning path, making it the first to provide in-depth evaluation of model decision-making processes. Additionally, we propose novel metrics that assess visual reasoning fidelity beyond mere accuracy, highlighting critical imbalances in current model reasoning patterns. Our comprehensive benchmarking of leading MLLMs uncovers significant limitations, underscoring the need for a balanced and holistic approach to both perception and reasoning. For more teaser and testing, visit our project page (https://verify-eqh.pages.dev/).