MLLMs知其所見：無需訓練的多模態LLMs對細微視覺細節的感知能力

摘要

近年來，多模態大型語言模型（MLLMs）在視覺識別任務中取得了快速進展。考慮到它們可能被整合到許多關鍵應用中，理解其視覺感知的局限性至關重要。在本研究中，我們探討了MLLMs在回答圖像相關問題時，是否能夠像感知大尺寸視覺內容一樣有效地感知細小視覺細節。我們觀察到，其表現對問題中視覺主體的大小非常敏感，並通過干預研究進一步證明這種影響實際上是因果性的。接著，我們研究了MLLMs在回答視覺問題時的注意力模式，有趣地發現，即使它們給出了錯誤答案，也始終知道該關注圖像的哪個部分。基於這些發現，我們隨後提出了無需訓練的視覺干預方法，這些方法利用任何MLLM自身的內部知識，以注意力和梯度圖的形式，來增強其對細小視覺細節的感知能力。我們在兩個廣泛使用的MLLMs和七個視覺問答基準上評估了我們提出的方法，結果表明，這些方法能夠顯著提高MLLMs的準確性，而無需任何訓練。我們的研究結果闡明了將MLLMs應用於涉及細小細節的視覺識別任務的風險，並表明利用模型內部狀態進行視覺干預是緩解這一風險的一個有前景的方向。

English

Multimodal Large Language Models (MLLMs) have experienced rapid progress in visual recognition tasks in recent years. Given their potential integration into many critical applications, it is important to understand the limitations of their visual perception. In this work, we study whether MLLMs can perceive small visual details as effectively as large ones when answering questions about images. We observe that their performance is very sensitive to the size of the visual subject of the question, and further show that this effect is in fact causal by conducting an intervention study. Next, we study the attention patterns of MLLMs when answering visual questions, and intriguingly find that they consistently know where to look, even when they provide the wrong answer. Based on these findings, we then propose training-free visual intervention methods that leverage the internal knowledge of any MLLM itself, in the form of attention and gradient maps, to enhance its perception of small visual details. We evaluate our proposed methods on two widely-used MLLMs and seven visual question answering benchmarks and show that they can significantly improve MLLMs' accuracy without requiring any training. Our results elucidate the risk of applying MLLMs to visual recognition tasks concerning small details and indicate that visual intervention using the model's internal state is a promising direction to mitigate this risk.

MLLMs知其所見：無需訓練的多模態LLMs對細微視覺細節的感知能力

MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs

摘要

Support