ChatPaper.aiChatPaper

从窄化视野到全景洞察:注意力引导的冷启动重塑多模态推理

From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning

March 4, 2026
作者: Ruilin Luo, Chufan Shi, Yizhen Zhang, Cheng Yang, Songtao Jiang, Tongkun Guan, Ruizhe Chen, Ruihang Chu, Peng Wang, Mingkun Yang, Yujiu Yang, Junyang Lin, Zhibo Yang
cs.AI

摘要

冷启动初始化阶段在训练多模态大推理模型(MLRMs)中具有关键作用,但其机制尚未得到充分理解。为分析该阶段,我们提出视觉注意力分数(VAS)——一种基于注意力的度量指标,用于量化模型对视觉标记的关注程度。研究发现推理性能与VAS呈强相关性(r=0.9616):VAS越高的模型在多模态推理任务中表现越优异。令人惊讶的是,多模态冷启动未能提升VAS,其注意力分布与基础模型相近;而纯文本冷启动则能显著提高VAS。我们将这一反直觉现象命名为"惰性注意力定位"。为验证其因果作用,我们设计了无需训练的直接干预方法,在推理过程中调控注意力分配,实现了1-2%的性能提升。基于这些发现,我们进一步提出注意力引导的视觉锚定与反思(AVAR)——一个整合视觉锚定数据合成、注意力引导目标和视觉锚定奖励塑形的综合冷启动框架。在Qwen2.5-VL-7B模型上的实验表明,AVAR在7个多模态推理基准测试中平均提升7.0%。消融研究进一步证实AVAR各组件对性能提升均具有阶梯式贡献。相关代码、数据及模型已开源:https://github.com/lrlbbzl/Qwen-AVAR。
English
The cold-start initialization stage plays a pivotal role in training Multimodal Large Reasoning Models (MLRMs), yet its mechanisms remain insufficiently understood. To analyze this stage, we introduce the Visual Attention Score (VAS), an attention-based metric that quantifies how much a model attends to visual tokens. We find that reasoning performance is strongly correlated with VAS (r=0.9616): models with higher VAS achieve substantially stronger multimodal reasoning. Surprisingly, multimodal cold-start fails to elevate VAS, resulting in attention distributions close to the base model, whereas text-only cold-start leads to a clear increase. We term this counter-intuitive phenomenon Lazy Attention Localization. To validate its causal role, we design training-free interventions that directly modulate attention allocation during inference, performance gains of 1-2% without any retraining. Building on these insights, we further propose Attention-Guided Visual Anchoring and Reflection (AVAR), a comprehensive cold-start framework that integrates visual-anchored data synthesis, attention-guided objectives, and visual-anchored reward shaping. Applied to Qwen2.5-VL-7B, AVAR achieves an average gain of 7.0% across 7 multimodal reasoning benchmarks. Ablation studies further confirm that each component of AVAR contributes step-wise to the overall gains. The code, data, and models are available at https://github.com/lrlbbzl/Qwen-AVAR.
PDF102March 16, 2026