想象感知标记增强多模态语言模型的空间推理能力

摘要

视觉语言模型（VLM）在许多任务中表现出色，但当关键信息无法直接观测时，它们仍难以应对空间推理问题。许多此类问题需要想象性感知：从不可见的视角推断可能观察到的内容、追踪穿过遮挡空间的路径，或将局部观测整合为连贯的空间表征。我们提出想象感知标记（Imaginative Perception Tokens, IPT），这是一种中间感知表征，能够外化VLM在替代空间配置下可能感知到的信息，同时保持与观测输入的一致性。为研究这一能力，我们设计了三个任务：视角推理（Perspective Taking, PET）、路径追踪（Path Tracing, PT）和多视角计数（Multiview Counting, MVC），并构建了约2万个包含真实想象、答案和评估基准的样本数据集。以统一VLM BAGEL作为骨干模型，IPT监督持续提升了空间推理能力，其效果通常优于基于文本的思维链训练，甚至无需在推理时生成图像。在MVC任务中，IPT将准确率提升了3.4%，并在PT任务中与强大的闭源模型达到竞争性表现。此外，我们发现将IPT与仅标签监督结合能带来额外增益，而基于文本的思维链则可能显著降低性能，这表明当空间计算被迫通过语言进行时会出现模态不匹配。总体而言，IPT为推理未观测的空间结构提供了一种原则性的监督信号，既能提升泛化能力，又能生成可解释的中间表征。

English

Vision language models (VLMs) excel at many tasks but still struggle with spatial reasoning when critical information is not directly observable. Many such problems require imaginative perception: inferring what would be seen from an unseen viewpoint, tracing paths through occluded spaces, or integrating partial observations into a coherent spatial representation. We introduce Imaginative Perception Tokens (IPT), intermediate perceptual representations that externalize what a VLM would perceive under alternative spatial configurations while remaining consistent with the observed input. To study this capability, we formulate three tasks, Perspective Taking (PET), Path Tracing (PT), and Multiview Counting (MVC), and construct datasets of approximately 20K examples with ground truth imaginations, answers, and evaluation benchmarks. Using the unified VLM BAGEL as the backbone, IPT supervision consistently improves spatial reasoning and often outperforms textual chain of thought training, even without generating images at inference time. On MVC, IPT improves accuracy by 3.4% and achieves competitive performance with strong closed-source models on PT. We further find that combining IPT and label-only supervision yields additional gains, whereas textual chain of thought can substantially degrade performance, suggesting a modality mismatch when spatial computation is forced through language. Overall, IPT provides a principled supervision signal for reasoning about unobserved spatial structure, improving generalization while producing interpretable intermediate representations.