对比注意力聚焦：提升视觉语言模型的视觉推理能力

摘要

视觉-语言模型（VLMs）在多种视觉任务中展现了显著的成功，然而在复杂视觉环境下的性能却有所下降。现有的增强方法通常需要额外训练、依赖外部分割工具或在粗粒度层面操作，却忽视了VLMs内在的能力。为填补这一空白，我们研究了VLMs的注意力模式，发现：（1）视觉复杂度与注意力熵强相关，对推理性能产生负面影响；（2）注意力从浅层的全局扫描逐步细化至深层的聚焦收敛，收敛程度由视觉复杂度决定；（3）理论上，我们证明了通用查询与任务特定查询间注意力图的对比，能够将视觉信号分解为语义信号和视觉噪声成分。基于这些洞见，我们提出了对比注意力精炼视觉增强方法（CARVE），这是一种无需训练的方法，通过在像素级别进行注意力对比提取任务相关的视觉信号。大量实验表明，CARVE持续提升性能，在开源模型上实现了高达75%的改进。我们的工作为视觉复杂度与注意力机制之间的相互作用提供了关键见解，为通过对比注意力提升视觉推理提供了一条高效路径。

English

Vision-Language Models (VLMs) have demonstrated remarkable success across diverse visual tasks, yet their performance degrades in complex visual environments. While existing enhancement approaches require additional training, rely on external segmentation tools, or operate at coarse-grained levels, they overlook the innate ability within VLMs. To bridge this gap, we investigate VLMs' attention patterns and discover that: (1) visual complexity strongly correlates with attention entropy, negatively impacting reasoning performance; (2) attention progressively refines from global scanning in shallow layers to focused convergence in deeper layers, with convergence degree determined by visual complexity. (3) Theoretically, we prove that the contrast of attention maps between general queries and task-specific queries enables the decomposition of visual signal into semantic signals and visual noise components. Building on these insights, we propose Contrastive Attention Refinement for Visual Enhancement (CARVE), a training-free method that extracts task-relevant visual signals through attention contrasting at the pixel level. Extensive experiments demonstrate that CARVE consistently enhances performance, achieving up to 75% improvement on open-source models. Our work provides critical insights into the interplay between visual complexity and attention mechanisms, offering an efficient pathway for improving visual reasoning with contrasting attention.

对比注意力聚焦：提升视觉语言模型的视觉推理能力

Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning

摘要

Support