大型视觉-语言模型在回答问题时关注何处?
Where do Large Vision-Language Models Look at when Answering Questions?
March 18, 2025
作者: Xiaoying Xing, Chia-Wen Kuo, Li Fuxin, Yulei Niu, Fan Chen, Ming Li, Ying Wu, Longyin Wen, Sijie Zhu
cs.AI
摘要
大型视觉语言模型(LVLMs)在视觉语言理解与推理任务中展现出了卓越的性能。然而,其视觉理解行为仍待深入探究。一个根本性问题随之而来:LVLMs在多大程度上依赖于视觉输入,以及图像的哪些区域对其响应起到了关键作用?由于LVLMs复杂的视觉架构(如多编码器与多分辨率处理)和可变长度输出,解读其自由形式的生成内容并非易事。本文中,我们扩展了现有的热力图可视化方法(例如iGOS++),以支持LVLMs在开放式视觉问答任务中的应用。我们提出了一种方法,用于筛选出反映生成答案与输入图像之间相关性的视觉相关标记。此外,我们在专门设计需依赖视觉信息作答的基准测试上,对当前最先进的LVLMs进行了全面分析。我们的研究揭示了关于LVLM行为的若干洞见,包括关注区域与答案正确性之间的关系、不同架构间视觉注意力的差异,以及大语言模型规模对视觉理解的影响。相关代码与数据已公开于https://github.com/bytedance/LVLM_Interpretation。
English
Large Vision-Language Models (LVLMs) have shown promising performance in
vision-language understanding and reasoning tasks. However, their visual
understanding behaviors remain underexplored. A fundamental question arises: to
what extent do LVLMs rely on visual input, and which image regions contribute
to their responses? It is non-trivial to interpret the free-form generation of
LVLMs due to their complicated visual architecture (e.g., multiple encoders and
multi-resolution) and variable-length outputs. In this paper, we extend
existing heatmap visualization methods (e.g., iGOS++) to support LVLMs for
open-ended visual question answering. We propose a method to select visually
relevant tokens that reflect the relevance between generated answers and input
image. Furthermore, we conduct a comprehensive analysis of state-of-the-art
LVLMs on benchmarks designed to require visual information to answer. Our
findings offer several insights into LVLM behavior, including the relationship
between focus region and answer correctness, differences in visual attention
across architectures, and the impact of LLM scale on visual understanding. The
code and data are available at
https://github.com/bytedance/LVLM_Interpretation.Summary
AI-Generated Summary