對比注意力聚焦：提升視覺語言模型的視覺推理能力

摘要

視覺-語言模型（VLMs）在多樣化的視覺任務中展現了顯著的成功，然而在複雜的視覺環境中其性能會有所下降。現有的增強方法需要額外的訓練、依賴於外部分割工具或僅在粗粒度層面上操作，卻忽視了VLMs內在的能力。為彌補這一差距，我們研究了VLMs的注意力模式並發現：(1) 視覺複雜性與注意力熵強相關，對推理性能產生負面影響；(2) 注意力從淺層的全局掃描逐步精煉至深層的聚焦收斂，收斂程度由視覺複雜性決定。(3) 理論上，我們證明了通用查詢與任務特定查詢之間的注意力圖對比能夠將視覺信號分解為語義信號和視覺噪聲成分。基於這些洞察，我們提出了對比注意力精煉視覺增強（CARVE），這是一種無需訓練的方法，通過像素級別的注意力對比提取任務相關的視覺信號。大量實驗表明，CARVE持續提升性能，在開源模型上實現了高達75%的改進。我們的工作為視覺複雜性與注意力機制之間的相互作用提供了關鍵見解，為利用對比注意力改善視覺推理提供了一條高效途徑。

English

Vision-Language Models (VLMs) have demonstrated remarkable success across diverse visual tasks, yet their performance degrades in complex visual environments. While existing enhancement approaches require additional training, rely on external segmentation tools, or operate at coarse-grained levels, they overlook the innate ability within VLMs. To bridge this gap, we investigate VLMs' attention patterns and discover that: (1) visual complexity strongly correlates with attention entropy, negatively impacting reasoning performance; (2) attention progressively refines from global scanning in shallow layers to focused convergence in deeper layers, with convergence degree determined by visual complexity. (3) Theoretically, we prove that the contrast of attention maps between general queries and task-specific queries enables the decomposition of visual signal into semantic signals and visual noise components. Building on these insights, we propose Contrastive Attention Refinement for Visual Enhancement (CARVE), a training-free method that extracts task-relevant visual signals through attention contrasting at the pixel level. Extensive experiments demonstrate that CARVE consistently enhances performance, achieving up to 75% improvement on open-source models. Our work provides critical insights into the interplay between visual complexity and attention mechanisms, offering an efficient pathway for improving visual reasoning with contrasting attention.

對比注意力聚焦：提升視覺語言模型的視覺推理能力

Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning

摘要

Support