MLLMはどこを見るべきかを知っている：マルチモーダルLLMによるトレーニング不要の微小視覚詳細の知覚

要旨

マルチモーダル大規模言語モデル（MLLMs）は、近年の視覚認識タスクにおいて急速な進展を遂げています。多くの重要なアプリケーションへの統合が期待される中、その視覚的知覚の限界を理解することが重要です。本研究では、MLLMsが画像に関する質問に答える際に、小さな視覚的詳細を大きなものと同様に効果的に認識できるかどうかを検証します。その結果、MLLMsの性能は質問の視覚的主題のサイズに非常に敏感であり、介入研究を通じてこの効果が実際に因果的であることを示します。次に、MLLMsが視覚的質問に答える際の注意パターンを調査し、興味深いことに、誤った答えを提供する場合でも、常にどこに注目すべきかを知っていることがわかりました。これらの知見に基づいて、我々はMLLMsの内部知識を活用したトレーニング不要の視覚的介入手法を提案します。具体的には、注意マップと勾配マップを利用して、小さな視覚的詳細の知覚を向上させます。提案手法を2つの広く使用されているMLLMsと7つの視覚的質問応答ベンチマークで評価し、トレーニングを必要とせずにMLLMsの精度を大幅に向上できることを示します。我々の結果は、小さな詳細に関連する視覚認識タスクにMLLMsを適用するリスクを明らかにし、モデルの内部状態を利用した視覚的介入がこのリスクを軽減する有望な方向性であることを示唆しています。

English

Multimodal Large Language Models (MLLMs) have experienced rapid progress in visual recognition tasks in recent years. Given their potential integration into many critical applications, it is important to understand the limitations of their visual perception. In this work, we study whether MLLMs can perceive small visual details as effectively as large ones when answering questions about images. We observe that their performance is very sensitive to the size of the visual subject of the question, and further show that this effect is in fact causal by conducting an intervention study. Next, we study the attention patterns of MLLMs when answering visual questions, and intriguingly find that they consistently know where to look, even when they provide the wrong answer. Based on these findings, we then propose training-free visual intervention methods that leverage the internal knowledge of any MLLM itself, in the form of attention and gradient maps, to enhance its perception of small visual details. We evaluate our proposed methods on two widely-used MLLMs and seven visual question answering benchmarks and show that they can significantly improve MLLMs' accuracy without requiring any training. Our results elucidate the risk of applying MLLMs to visual recognition tasks concerning small details and indicate that visual intervention using the model's internal state is a promising direction to mitigate this risk.

MLLMはどこを見るべきかを知っている：マルチモーダルLLMによるトレーニング不要の微小視覚詳細の知覚

MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs

要旨

Support