MLLMs知其所見:無需訓練的多模態LLMs對細微視覺細節的感知能力
MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs
February 24, 2025
作者: Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, Filip Ilievski
cs.AI
摘要
近年來,多模態大型語言模型(MLLMs)在視覺識別任務中取得了快速進展。考慮到它們可能被整合到許多關鍵應用中,理解其視覺感知的局限性至關重要。在本研究中,我們探討了MLLMs在回答圖像相關問題時,是否能夠像感知大尺寸視覺內容一樣有效地感知細小視覺細節。我們觀察到,其表現對問題中視覺主體的大小非常敏感,並通過干預研究進一步證明這種影響實際上是因果性的。接著,我們研究了MLLMs在回答視覺問題時的注意力模式,有趣地發現,即使它們給出了錯誤答案,也始終知道該關注圖像的哪個部分。基於這些發現,我們隨後提出了無需訓練的視覺干預方法,這些方法利用任何MLLM自身的內部知識,以注意力和梯度圖的形式,來增強其對細小視覺細節的感知能力。我們在兩個廣泛使用的MLLMs和七個視覺問答基準上評估了我們提出的方法,結果表明,這些方法能夠顯著提高MLLMs的準確性,而無需任何訓練。我們的研究結果闡明了將MLLMs應用於涉及細小細節的視覺識別任務的風險,並表明利用模型內部狀態進行視覺干預是緩解這一風險的一個有前景的方向。
English
Multimodal Large Language Models (MLLMs) have experienced rapid progress in
visual recognition tasks in recent years. Given their potential integration
into many critical applications, it is important to understand the limitations
of their visual perception. In this work, we study whether MLLMs can perceive
small visual details as effectively as large ones when answering questions
about images. We observe that their performance is very sensitive to the size
of the visual subject of the question, and further show that this effect is in
fact causal by conducting an intervention study. Next, we study the attention
patterns of MLLMs when answering visual questions, and intriguingly find that
they consistently know where to look, even when they provide the wrong answer.
Based on these findings, we then propose training-free visual intervention
methods that leverage the internal knowledge of any MLLM itself, in the form of
attention and gradient maps, to enhance its perception of small visual details.
We evaluate our proposed methods on two widely-used MLLMs and seven visual
question answering benchmarks and show that they can significantly improve
MLLMs' accuracy without requiring any training. Our results elucidate the risk
of applying MLLMs to visual recognition tasks concerning small details and
indicate that visual intervention using the model's internal state is a
promising direction to mitigate this risk.Summary
AI-Generated Summary