从窄化视野到全景洞察：注意力引导的冷启动重塑多模态推理

摘要

冷启动初始化阶段在训练多模态大推理模型(MLRMs)中具有关键作用，但其机制尚未得到充分理解。为分析该阶段，我们提出视觉注意力分数(VAS)——一种基于注意力的度量指标，用于量化模型对视觉标记的关注程度。研究发现推理性能与VAS呈强相关性(r=0.9616)：VAS越高的模型在多模态推理任务中表现越优异。令人惊讶的是，多模态冷启动未能提升VAS，其注意力分布与基础模型相近；而纯文本冷启动则能显著提高VAS。我们将这一反直觉现象命名为"惰性注意力定位"。为验证其因果作用，我们设计了无需训练的直接干预方法，在推理过程中调控注意力分配，实现了1-2%的性能提升。基于这些发现，我们进一步提出注意力引导的视觉锚定与反思(AVAR)——一个整合视觉锚定数据合成、注意力引导目标和视觉锚定奖励塑形的综合冷启动框架。在Qwen2.5-VL-7B模型上的实验表明，AVAR在7个多模态推理基准测试中平均提升7.0%。消融研究进一步证实AVAR各组件对性能提升均具有阶梯式贡献。相关代码、数据及模型已开源：https://github.com/lrlbbzl/Qwen-AVAR。

English

The cold-start initialization stage plays a pivotal role in training Multimodal Large Reasoning Models (MLRMs), yet its mechanisms remain insufficiently understood. To analyze this stage, we introduce the Visual Attention Score (VAS), an attention-based metric that quantifies how much a model attends to visual tokens. We find that reasoning performance is strongly correlated with VAS (r=0.9616): models with higher VAS achieve substantially stronger multimodal reasoning. Surprisingly, multimodal cold-start fails to elevate VAS, resulting in attention distributions close to the base model, whereas text-only cold-start leads to a clear increase. We term this counter-intuitive phenomenon Lazy Attention Localization. To validate its causal role, we design training-free interventions that directly modulate attention allocation during inference, performance gains of 1-2% without any retraining. Building on these insights, we further propose Attention-Guided Visual Anchoring and Reflection (AVAR), a comprehensive cold-start framework that integrates visual-anchored data synthesis, attention-guided objectives, and visual-anchored reward shaping. Applied to Qwen2.5-VL-7B, AVAR achieves an average gain of 7.0% across 7 multimodal reasoning benchmarks. Ablation studies further confirm that each component of AVAR contributes step-wise to the overall gains. The code, data, and models are available at https://github.com/lrlbbzl/Qwen-AVAR.