大型視覺語言模型在回答問題時關注何處？

摘要

大型视觉语言模型（LVLMs）在视觉语言理解与推理任务中展现出了令人瞩目的性能。然而，其视觉理解行为仍待深入探究。一个根本性问题浮现：LVLMs在多大程度上依赖于视觉输入，以及哪些图像区域对其响应有所贡献？由于LVLMs复杂的视觉架构（例如，多重编码器与多分辨率处理）及可变长度输出，解读其自由形式的生成过程并非易事。本文中，我们扩展了现有的热图可视化方法（如iGOS++），以支持LVLMs在开放式视觉问答中的应用。我们提出了一种方法，用于筛选出反映生成答案与输入图像相关性的视觉相关标记。此外，我们在设计上需要视觉信息来回答的基准测试上，对当前最先进的LVLMs进行了全面分析。我们的发现为理解LVLM行为提供了多项洞见，包括关注区域与答案正确性之间的关系、不同架构间视觉注意力的差异，以及LLM规模对视觉理解的影响。代码与数据已公开于https://github.com/bytedance/LVLM_Interpretation。

English

Large Vision-Language Models (LVLMs) have shown promising performance in vision-language understanding and reasoning tasks. However, their visual understanding behaviors remain underexplored. A fundamental question arises: to what extent do LVLMs rely on visual input, and which image regions contribute to their responses? It is non-trivial to interpret the free-form generation of LVLMs due to their complicated visual architecture (e.g., multiple encoders and multi-resolution) and variable-length outputs. In this paper, we extend existing heatmap visualization methods (e.g., iGOS++) to support LVLMs for open-ended visual question answering. We propose a method to select visually relevant tokens that reflect the relevance between generated answers and input image. Furthermore, we conduct a comprehensive analysis of state-of-the-art LVLMs on benchmarks designed to require visual information to answer. Our findings offer several insights into LVLM behavior, including the relationship between focus region and answer correctness, differences in visual attention across architectures, and the impact of LLM scale on visual understanding. The code and data are available at https://github.com/bytedance/LVLM_Interpretation.

大型視覺語言模型在回答問題時關注何處？

Where do Large Vision-Language Models Look at when Answering Questions?

摘要

Support