對比注意力聚焦:提升視覺語言模型的視覺推理能力
Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning
September 8, 2025
作者: Yuyao Ge, Shenghua Liu, Yiwei Wang, Lingrui Mei, Baolong Bi, Xuanshan Zhou, Jiayu Yao, Jiafeng Guo, Xueqi Cheng
cs.AI
摘要
視覺-語言模型(VLMs)在多樣化的視覺任務中展現了顯著的成功,然而在複雜的視覺環境中其性能會有所下降。現有的增強方法需要額外的訓練、依賴於外部分割工具或僅在粗粒度層面上操作,卻忽視了VLMs內在的能力。為彌補這一差距,我們研究了VLMs的注意力模式並發現:(1) 視覺複雜性與注意力熵強相關,對推理性能產生負面影響;(2) 注意力從淺層的全局掃描逐步精煉至深層的聚焦收斂,收斂程度由視覺複雜性決定。(3) 理論上,我們證明了通用查詢與任務特定查詢之間的注意力圖對比能夠將視覺信號分解為語義信號和視覺噪聲成分。基於這些洞察,我們提出了對比注意力精煉視覺增強(CARVE),這是一種無需訓練的方法,通過像素級別的注意力對比提取任務相關的視覺信號。大量實驗表明,CARVE持續提升性能,在開源模型上實現了高達75%的改進。我們的工作為視覺複雜性與注意力機制之間的相互作用提供了關鍵見解,為利用對比注意力改善視覺推理提供了一條高效途徑。
English
Vision-Language Models (VLMs) have demonstrated remarkable success across
diverse visual tasks, yet their performance degrades in complex visual
environments. While existing enhancement approaches require additional
training, rely on external segmentation tools, or operate at coarse-grained
levels, they overlook the innate ability within VLMs. To bridge this gap, we
investigate VLMs' attention patterns and discover that: (1) visual complexity
strongly correlates with attention entropy, negatively impacting reasoning
performance; (2) attention progressively refines from global scanning in
shallow layers to focused convergence in deeper layers, with convergence degree
determined by visual complexity. (3) Theoretically, we prove that the contrast
of attention maps between general queries and task-specific queries enables the
decomposition of visual signal into semantic signals and visual noise
components. Building on these insights, we propose Contrastive Attention
Refinement for Visual Enhancement (CARVE), a training-free method that extracts
task-relevant visual signals through attention contrasting at the pixel level.
Extensive experiments demonstrate that CARVE consistently enhances performance,
achieving up to 75% improvement on open-source models. Our work provides
critical insights into the interplay between visual complexity and attention
mechanisms, offering an efficient pathway for improving visual reasoning with
contrasting attention.