有限视角下的空间心理建模

摘要

视觉语言模型（VLMs）能否像人类一样，仅凭少量视角就想象出完整的场景？人类通过构建空间心理模型，即对不可见空间的内部表征，来推理布局、视角和运动。我们新推出的MindCube基准测试，包含3,268张图像中的21,154个问题，揭示了这一关键差距，现有VLMs在此类任务上表现近乎随机。借助MindCube，我们系统评估了VLMs在构建稳健空间心理模型方面的能力，包括位置表征（认知映射）、方向理解（视角采择）以及动态推理（“假设”运动的心理模拟）。随后，我们探索了三种帮助VLMs近似空间心理模型的方法：未见的中间视角、自然语言推理链以及认知地图。其中，显著的提升来自于一种协同策略——“先绘图后推理”，该策略联合训练模型首先生成认知地图，然后基于地图进行推理。通过训练模型在这些内部地图上进行推理，我们将准确率从37.8%提升至60.8%（+23.0%）。进一步引入强化学习后，性能跃升至70.7%（+32.9%）。我们的核心发现是，这种空间心理模型的支架式构建，即主动创建并利用内部结构化空间表征，结合灵活的推理过程，显著增强了对不可观察空间的理解。

English

Can Vision Language Models (VLMs) imagine the full scene from just a few views, like humans do? Humans form spatial mental models, internal representations of unseen space, to reason about layout, perspective, and motion. Our new MindCube benchmark with 21,154 questions across 3,268 images exposes this critical gap, where existing VLMs exhibit near-random performance. Using MindCube, we systematically evaluate how well VLMs build robust spatial mental models through representing positions (cognitive mapping), orientations (perspective-taking), and dynamics (mental simulation for "what-if" movements). We then explore three approaches to help VLMs approximate spatial mental models, including unseen intermediate views, natural language reasoning chains, and cognitive maps. The significant improvement comes from a synergistic approach, "map-then-reason", that jointly trains the model to first generate a cognitive map and then reason upon it. By training models to reason over these internal maps, we boosted accuracy from 37.8% to 60.8% (+23.0%). Adding reinforcement learning pushed performance even further to 70.7% (+32.9%). Our key insight is that such scaffolding of spatial mental models, actively constructing and utilizing internal structured spatial representations with flexible reasoning processes, significantly improves understanding of unobservable space.

有限视角下的空间心理建模

Spatial Mental Modeling from Limited Views

摘要

Support