コントラスト的注意によるフォーカス：視覚言語モデルの視覚的推論能力の向上

要旨

Vision-Language Models（VLM）は多様な視覚タスクにおいて顕著な成功を収めているが、複雑な視覚環境ではその性能が低下する。既存の改善手法は追加のトレーニングを必要とするか、外部のセグメンテーションツールに依存するか、粗い粒度で動作するものが多く、VLMの内在的な能力を見落としている。このギャップを埋めるため、我々はVLMの注意パターンを調査し、以下の発見を得た：(1) 視覚的複雑さは注意エントロピーと強く相関し、推論性能に負の影響を与える。(2) 注意は浅い層でのグローバルなスキャンから深い層での集中した収束へと段階的に洗練され、その収束度は視覚的複雑さによって決定される。(3) 理論的に、一般的なクエリとタスク固有のクエリ間の注意マップのコントラストが、視覚信号を意味信号と視覚ノイズ成分に分解することを証明した。これらの知見に基づき、我々はContrastive Attention Refinement for Visual Enhancement（CARVE）を提案する。これはピクセルレベルでの注意のコントラストを通じてタスク関連の視覚信号を抽出するトレーニング不要の手法である。大規模な実験により、CARVEが一貫して性能を向上させ、オープンソースモデルで最大75%の改善を達成することが実証された。本研究は、視覚的複雑さと注意メカニズムの相互作用に関する重要な洞察を提供し、コントラスト注意を用いた視覚推論の改善に向けた効率的な道筋を示すものである。

English

Vision-Language Models (VLMs) have demonstrated remarkable success across diverse visual tasks, yet their performance degrades in complex visual environments. While existing enhancement approaches require additional training, rely on external segmentation tools, or operate at coarse-grained levels, they overlook the innate ability within VLMs. To bridge this gap, we investigate VLMs' attention patterns and discover that: (1) visual complexity strongly correlates with attention entropy, negatively impacting reasoning performance; (2) attention progressively refines from global scanning in shallow layers to focused convergence in deeper layers, with convergence degree determined by visual complexity. (3) Theoretically, we prove that the contrast of attention maps between general queries and task-specific queries enables the decomposition of visual signal into semantic signals and visual noise components. Building on these insights, we propose Contrastive Attention Refinement for Visual Enhancement (CARVE), a training-free method that extracts task-relevant visual signals through attention contrasting at the pixel level. Extensive experiments demonstrate that CARVE consistently enhances performance, achieving up to 75% improvement on open-source models. Our work provides critical insights into the interplay between visual complexity and attention mechanisms, offering an efficient pathway for improving visual reasoning with contrasting attention.

コントラスト的注意によるフォーカス：視覚言語モデルの視覚的推論能力の向上

Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning

要旨

Support