LaViT:面向多模态推理的潜在视觉思维对齐
LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning
January 15, 2026
作者: Linquan Wu, Tianxiang Jiang, Yifei Dong, Haoyu Yang, Fengji Zhang, Shichaang Meng, Ai Xuan, Linqi Song, Jacky Keung
cs.AI
摘要
当前多模态潜在推理常依赖外部监督(如辅助图像),忽略了内在的视觉注意力动态机制。本研究揭示了蒸馏过程中存在的关键性感知鸿沟:学生模型往往在关注完全不同的视觉区域时模仿教师的文本输出,实质上依赖于语言先验而非接地感知。为弥合这一差距,我们提出LaViT框架——通过对齐潜在视觉思维而非静态嵌入来实现优化。LaViT采用课程式感官门控机制,强制学生在文本生成前自回归地重构教师的视觉语义与注意力轨迹,从而规避捷径学习。大量实验表明,LaViT显著增强了视觉基础能力,在复杂推理任务上最高可获得+16.9%的性能提升,使紧凑的30亿参数模型能够超越更大规模的开源变体及GPT-4o等专有模型。
English
Current multimodal latent reasoning often relies on external supervision (e.g., auxiliary images), ignoring intrinsic visual attention dynamics. In this work, we identify a critical Perception Gap in distillation: student models frequently mimic a teacher's textual output while attending to fundamentally divergent visual regions, effectively relying on language priors rather than grounded perception. To bridge this, we propose LaViT, a framework that aligns latent visual thoughts rather than static embeddings. LaViT compels the student to autoregressively reconstruct the teacher's visual semantics and attention trajectories prior to text generation, employing a curriculum sensory gating mechanism to prevent shortcut learning. Extensive experiments show that LaViT significantly enhances visual grounding, achieving up to +16.9% gains on complex reasoning tasks and enabling a compact 3B model to outperform larger open-source variants and proprietary models like GPT-4o.