VERIFY:一個用於探究多模態推理保真度的視覺解釋與推理基準
VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity
March 14, 2025
作者: Jing Bi, Junjia Guo, Susan Liang, Guangyu Sun, Luchuan Song, Yunlong Tang, Jinxi He, Jiarui Wu, Ali Vosoughi, Chen Chen, Chenliang Xu
cs.AI
摘要
視覺推理是人類認知的核心,使個體能夠解釋並抽象地理解其環境。儘管近期的多模態大型語言模型(MLLMs)在語言和視覺-語言任務中展現了令人印象深刻的表現,現有的基準測試主要衡量基於識別的技能,未能充分評估真正的視覺推理能力。為彌補這一關鍵差距,我們引入了VERIFY,這是一個專門設計來隔離並嚴格評估最先進MLLMs視覺推理能力的基準測試。VERIFY迫使模型主要從視覺信息進行推理,提供最少的文本上下文,以減少對領域特定知識和語言偏見的依賴。每個問題都伴隨著人工註釋的推理路徑,使其成為首個深入評估模型決策過程的基準。此外,我們提出了新穎的指標,這些指標超越了單純的準確性,評估視覺推理的真實性,突顯了當前模型推理模式中的關鍵不平衡。我們對領先MLLMs的全面基準測試揭示了顯著的局限性,強調了在感知和推理方面需要採取平衡且全面的方法。欲了解更多預覽和測試,請訪問我們的項目頁面(https://verify-eqh.pages.dev/)。
English
Visual reasoning is central to human cognition, enabling individuals to
interpret and abstractly understand their environment. Although recent
Multimodal Large Language Models (MLLMs) have demonstrated impressive
performance across language and vision-language tasks, existing benchmarks
primarily measure recognition-based skills and inadequately assess true visual
reasoning capabilities. To bridge this critical gap, we introduce VERIFY, a
benchmark explicitly designed to isolate and rigorously evaluate the visual
reasoning capabilities of state-of-the-art MLLMs. VERIFY compels models to
reason primarily from visual information, providing minimal textual context to
reduce reliance on domain-specific knowledge and linguistic biases. Each
problem is accompanied by a human-annotated reasoning path, making it the first
to provide in-depth evaluation of model decision-making processes.
Additionally, we propose novel metrics that assess visual reasoning fidelity
beyond mere accuracy, highlighting critical imbalances in current model
reasoning patterns. Our comprehensive benchmarking of leading MLLMs uncovers
significant limitations, underscoring the need for a balanced and holistic
approach to both perception and reasoning. For more teaser and testing, visit
our project page (https://verify-eqh.pages.dev/).Summary
AI-Generated Summary