대규모 시각-언어 모델은 질문에 답할 때 어디를 주시하는가?

초록

대규모 시각-언어 모델(LVLMs)은 시각-언어 이해 및 추론 작업에서 유망한 성능을 보여주고 있습니다. 그러나 이들의 시각적 이해 행동은 아직 충분히 탐구되지 않았습니다. 근본적인 질문이 제기됩니다: LVLMs가 시각적 입력에 어느 정도 의존하며, 어떤 이미지 영역이 그들의 응답에 기여하는가? LVLMs의 자유 형식 생성은 복잡한 시각적 아키텍처(예: 다중 인코더 및 다중 해상도)와 가변 길이 출력으로 인해 해석하기 쉽지 않습니다. 본 논문에서는 기존의 열지도 시각화 방법(예: iGOS++)를 확장하여 개방형 시각 질문 응답을 위한 LVLMs를 지원합니다. 우리는 생성된 답변과 입력 이미지 간의 관련성을 반영하는 시각적으로 관련된 토큰을 선택하는 방법을 제안합니다. 또한, 시각 정보가 필요한 벤치마크에서 최신 LVLMs에 대한 포괄적인 분석을 수행합니다. 우리의 연구 결과는 초점 영역과 답변 정확성 간의 관계, 아키텍처 간 시각적 주의 차이, 그리고 LLM 규모가 시각적 이해에 미치는 영향 등 LVLM 행동에 대한 여러 통찰을 제공합니다. 코드와 데이터는 https://github.com/bytedance/LVLM_Interpretation에서 확인할 수 있습니다.

English

Large Vision-Language Models (LVLMs) have shown promising performance in vision-language understanding and reasoning tasks. However, their visual understanding behaviors remain underexplored. A fundamental question arises: to what extent do LVLMs rely on visual input, and which image regions contribute to their responses? It is non-trivial to interpret the free-form generation of LVLMs due to their complicated visual architecture (e.g., multiple encoders and multi-resolution) and variable-length outputs. In this paper, we extend existing heatmap visualization methods (e.g., iGOS++) to support LVLMs for open-ended visual question answering. We propose a method to select visually relevant tokens that reflect the relevance between generated answers and input image. Furthermore, we conduct a comprehensive analysis of state-of-the-art LVLMs on benchmarks designed to require visual information to answer. Our findings offer several insights into LVLM behavior, including the relationship between focus region and answer correctness, differences in visual attention across architectures, and the impact of LLM scale on visual understanding. The code and data are available at https://github.com/bytedance/LVLM_Interpretation.

대규모 시각-언어 모델은 질문에 답할 때 어디를 주시하는가?

Where do Large Vision-Language Models Look at when Answering Questions?

초록

Support