MVI-Bench：評估大型視覺語言模型對誤導性視覺輸入魯棒性的綜合基準

摘要

評估大型視覺語言模型（LVLM）的穩健性，對於其在實際應用中的持續發展與負責任部署至關重要。然而，現有的穩健性基準測試通常側重於幻覺或誤導性文本輸入，卻在很大程度上忽略了誤導性視覺輸入對評估視覺理解能力所構成的同等關鍵挑戰。為填補這一重要空白，我們推出首個全面性基準測試框架MVI-Bench，專門用於評估誤導性視覺輸入如何削弱LVLM的穩健性。該框架基於基礎視覺基元，圍繞誤導性視覺輸入的三個層次結構進行設計：視覺概念、視覺屬性和視覺關係。依此分類體系，我們篩選出六個代表性類別，並彙整了1,248個經專業標註的視覺問答實例。為實現細粒度穩健性評估，我們進一步提出創新指標MVI-敏感度，能在微觀層面表徵LVLM的穩健性。對18個前沿LVLM的實證研究揭示了其對誤導性視覺輸入的顯著脆弱性，而我們透過MVI-Bench進行的深度分析提供了可指導開發更可靠、更穩健LVLM的實用洞見。基準測試框架與程式碼庫可透過https://github.com/chenyil6/MVI-Bench 取得。

English

Evaluating the robustness of Large Vision-Language Models (LVLMs) is essential for their continued development and responsible deployment in real-world applications. However, existing robustness benchmarks typically focus on hallucination or misleading textual inputs, while largely overlooking the equally critical challenge posed by misleading visual inputs in assessing visual understanding. To fill this important gap, we introduce MVI-Bench, the first comprehensive benchmark specially designed for evaluating how Misleading Visual Inputs undermine the robustness of LVLMs. Grounded in fundamental visual primitives, the design of MVI-Bench centers on three hierarchical levels of misleading visual inputs: Visual Concept, Visual Attribute, and Visual Relationship. Using this taxonomy, we curate six representative categories and compile 1,248 expertly annotated VQA instances. To facilitate fine-grained robustness evaluation, we further introduce MVI-Sensitivity, a novel metric that characterizes LVLM robustness at a granular level. Empirical results across 18 state-of-the-art LVLMs uncover pronounced vulnerabilities to misleading visual inputs, and our in-depth analyses on MVI-Bench provide actionable insights that can guide the development of more reliable and robust LVLMs. The benchmark and codebase can be accessed at https://github.com/chenyil6/MVI-Bench.

MVI-Bench：評估大型視覺語言模型對誤導性視覺輸入魯棒性的綜合基準

MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs

摘要

Support