ChatPaper.aiChatPaper

如何想象以及想象什么?面向跨视角空间推理的统一多模态模型中的视觉思维

How and What to Imagine? Visual Thinking in Unified Multimodal Models for Cross-View Spatial Reasoning

May 26, 2026
作者: Qian Yang, Ankur Sikarwar, Huy Le, Le Zhang, Zhuan Shi, Perouz Taslakian, Aishwarya Agrawal
cs.AI

摘要

跨视图空间推理仍然是视觉语言模型(VLMs)的薄弱环节:它们通常依赖语言进行推理,从而丢失完成任务所需的细粒度几何信息。基于图像的思维方法试图通过生成中间思维图像来解决这一问题,但近期研究表明,模型常常忽略这些思维轨迹中的视觉证据。因此,我们探讨如何使视觉思维产生实际作用,以及何种视觉思维最为有效。本研究在统一多模态模型(UMMs)框架下展开,这类模型原生支持图像与文本交错生成。针对第一个问题,我们提出视图丢弃(VDrop)——一种训练时干预手段,在保持输入视图部分区域对思维图像可见的同时,将其从答案生成区间隐藏。这促使模型在回答问题时必须借助思维图像,而非仅依赖原始输入视图。在确定思维图像被用于答案预测后,我们进一步研究最有效的视觉思维类型。我们将其归纳为可学习性与信息量之间的权衡,并比较了三种思维图像变体:自上而下、全景和点匹配渲染图。在合成场景上训练并在五个真实世界域外基准上评估后,采用VDrop的全景视觉思维是唯一兼具信息性与可学习性的配置,并实现了最优的域外泛化性能。
English
Cross-view spatial reasoning remains a weak spot for vision-language models (VLMs): they often reason in language and lose the fine-grained geometry needed for the task. Thinking with images aims to address this by generating an intermediate thinking image, but recent work shows that models often ignore the visual evidence in these traces. We therefore ask how to make visual thinking matter, and what kind of visual thinking works best. We study these questions in unified multimodal models (UMMs), which natively support interleaved image-text generation. For the first question, we propose View Dropout (VDrop), a training-time intervention that hides parts of one input view from the answer span while keeping them visible to the thinking-image tokens. This encourages the model to use the thinking image when answering, instead of relying only on the input views. Once the thinking image is used for answer prediction, we study which type of visual thinking is most effective. We frame this as a learnability-informativeness tradeoff and compare three thinking-image variants: top-down, panoramic, and point-matching renderings. Trained on synthetic scenes and evaluated on five real-world out-of-domain benchmarks, panoramic visual thinking with VDrop is the only configuration that is both informative and learnable, and it achieves the best out-of-domain generalization.