大規模視覚言語モデルは質問に答える際にどこを見ているのか？

要旨

大規模視覚言語モデル（LVLM）は、視覚言語理解と推論タスクにおいて有望な性能を示しています。しかし、その視覚理解の振る舞いはまだ十分に解明されていません。根本的な疑問が生じます：LVLMはどの程度視覚入力を依存しているのか、またどの画像領域がその応答に寄与しているのか？LVLMの自由形式生成を解釈することは、複雑な視覚アーキテクチャ（例：複数のエンコーダやマルチ解像度）や可変長の出力のため、容易ではありません。本論文では、既存のヒートマップ可視化手法（例：iGOS++）を拡張し、オープンエンドの視覚的質問応答をサポートするLVLMに対応させます。生成された回答と入力画像の関連性を反映する視覚的に関連性のあるトークンを選択する手法を提案します。さらに、視覚情報を必要とするように設計されたベンチマークにおいて、最先端のLVLMの包括的な分析を行います。私たちの調査結果は、焦点領域と回答の正確性の関係、アーキテクチャ間の視覚的注意の違い、LLMのスケールが視覚理解に与える影響など、LVLMの振る舞いに関するいくつかの洞察を提供します。コードとデータはhttps://github.com/bytedance/LVLM_Interpretationで公開されています。

English

Large Vision-Language Models (LVLMs) have shown promising performance in vision-language understanding and reasoning tasks. However, their visual understanding behaviors remain underexplored. A fundamental question arises: to what extent do LVLMs rely on visual input, and which image regions contribute to their responses? It is non-trivial to interpret the free-form generation of LVLMs due to their complicated visual architecture (e.g., multiple encoders and multi-resolution) and variable-length outputs. In this paper, we extend existing heatmap visualization methods (e.g., iGOS++) to support LVLMs for open-ended visual question answering. We propose a method to select visually relevant tokens that reflect the relevance between generated answers and input image. Furthermore, we conduct a comprehensive analysis of state-of-the-art LVLMs on benchmarks designed to require visual information to answer. Our findings offer several insights into LVLM behavior, including the relationship between focus region and answer correctness, differences in visual attention across architectures, and the impact of LLM scale on visual understanding. The code and data are available at https://github.com/bytedance/LVLM_Interpretation.

大規模視覚言語モデルは質問に答える際にどこを見ているのか？

Where do Large Vision-Language Models Look at when Answering Questions?

要旨

Support