좁은 시야에서 파노라마 시야로: 주의력 기반 콜드 스타트가 다중 모달 추론을 재구성하다

초록

콜드 스타트 초기화 단계는 다중모달 대규모 추론 모델(MLRM) 학습에서 핵심적인 역할을 수행하지만, 그 작동 메커니즘은 아직 충분히 이해되지 않고 있습니다. 본 연구에서는 이 단계를 분석하기 위해 시각적 어텐션 점수(VAS)를 제안합니다. VAS는 모델이 시각 토큰에 할당하는 어텐션 양을 정량화하는 어텐션 기반 지표입니다. 우리는 추론 성능이 VAS와 강한 상관관계(r=0.9616)를 보인다는 것을 발견했습니다: 높은 VAS를 보이는 모델이 훨씬 더 강력한 다중모달 추론 성능을 달성했습니다. 놀랍게도, 다중모달 콜드 스타트는 VAS를 높이지 못해 기본 모델과 유사한 어텐션 분포를 보인 반면, 텍스트 전용 콜드 스타트는 VAS의 명확한 증가를 이끌었습니다. 우리는 이러한 직관에 반하는 현상을 '게으른 어텐션 지역화(Lazy Attention Localization)'라고 명명했습니다. 이 현상의 인과적 역할을 검증하기 위해 추론 과정에서 어텐션 할당을 직접 조절하는 학습 불필요형 인터벤션을 설계하였으며, 이를 통해 재학습 없이 1-2%의 성능 향상을 확인했습니다. 이러한 통찰을 바탕으로 우리는 시각 앵커 데이터 합성, 어텐션 유도 목적함수, 시각 앵커 보상 형성을 통합한 포괄적인 콜드 스타트 프레임워크인 어텐션 유도 시각 앵커링 및 리플렉션(AVAR)을 추가로 제안합니다. Qwen2.5-VL-7B에 적용된 AVAR는 7개의 다중모달 추론 벤치마크에서 평균 7.0%의 성능 향상을 달성했습니다. 어블레이션 연구를 통해 AVAR의 각 구성 요소가 단계적으로 전체 성능 향상에 기여함을 추가로 확인했습니다. 코드, 데이터 및 모델은 https://github.com/lrlbbzl/Qwen-AVAR에서 확인할 수 있습니다.

English

The cold-start initialization stage plays a pivotal role in training Multimodal Large Reasoning Models (MLRMs), yet its mechanisms remain insufficiently understood. To analyze this stage, we introduce the Visual Attention Score (VAS), an attention-based metric that quantifies how much a model attends to visual tokens. We find that reasoning performance is strongly correlated with VAS (r=0.9616): models with higher VAS achieve substantially stronger multimodal reasoning. Surprisingly, multimodal cold-start fails to elevate VAS, resulting in attention distributions close to the base model, whereas text-only cold-start leads to a clear increase. We term this counter-intuitive phenomenon Lazy Attention Localization. To validate its causal role, we design training-free interventions that directly modulate attention allocation during inference, performance gains of 1-2% without any retraining. Building on these insights, we further propose Attention-Guided Visual Anchoring and Reflection (AVAR), a comprehensive cold-start framework that integrates visual-anchored data synthesis, attention-guided objectives, and visual-anchored reward shaping. Applied to Qwen2.5-VL-7B, AVAR achieves an average gain of 7.0% across 7 multimodal reasoning benchmarks. Ablation studies further confirm that each component of AVAR contributes step-wise to the overall gains. The code, data, and models are available at https://github.com/lrlbbzl/Qwen-AVAR.

좁은 시야에서 파노라마 시야로: 주의력 기반 콜드 스타트가 다중 모달 추론을 재구성하다

From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning

초록

Support