潜在隐式视觉推理
Latent Implicit Visual Reasoning
December 24, 2025
作者: Kelvin Li, Chuyi Shang, Leonid Karlinsky, Rogerio Feris, Trevor Darrell, Roei Herzig
cs.AI
摘要
尽管大型多模态模型(LMMs)已取得显著进展,但其本质上仍以文本为核心,将语言作为主要推理模态。这导致它们在处理以视觉为主的推理任务时存在明显局限。近期研究尝试通过辅助图像、深度图或图像裁剪来监督中间视觉步骤以解决该问题,但这些策略对"有效"视觉抽象形式施加了限制性先验,增加了繁重的标注成本,且难以实现跨任务泛化。为突破这一关键局限,我们提出一种任务无关的机制,通过无显式监督的方式训练LMMs自主发现并运用视觉推理标记。这些标记能进行全局注意力计算并以任务自适应方式对图像进行重编码,使模型无需人工标注即可提取相关视觉信息。我们的方法在多样化视觉中心任务(包括难以定义中间抽象层的任务)上超越了直接微调的效果,取得了最先进的性能,同时还能泛化至多任务指令调优场景。
English
While Large Multimodal Models (LMMs) have made significant progress, they remain largely text-centric, relying on language as their core reasoning modality. As a result, they are limited in their ability to handle reasoning tasks that are predominantly visual. Recent approaches have sought to address this by supervising intermediate visual steps with helper images, depth maps, or image crops. However, these strategies impose restrictive priors on what "useful" visual abstractions look like, add heavy annotation costs, and struggle to generalize across tasks. To address this critical limitation, we propose a task-agnostic mechanism that trains LMMs to discover and use visual reasoning tokens without explicit supervision. These tokens attend globally and re-encode the image in a task-adaptive way, enabling the model to extract relevant visual information without hand-crafted supervision. Our approach outperforms direct fine-tuning and achieves state-of-the-art results on a diverse range of vision-centric tasks -- including those where intermediate abstractions are hard to specify -- while also generalizing to multi-task instruction tuning.