有限視角下的空間心智建模

摘要

视觉语言模型（VLMs）能否像人类一样，仅凭少数视角便构想出完整场景？人类通过构建空间心理模型，即对未见空间的内在表征，来推理布局、视角及运动。我们新推出的MindCube基准测试，包含3,268张图像上的21,154个问题，揭示了这一关键差距，现有VLMs在此类任务上表现近乎随机。借助MindCube，我们系统评估了VLMs在构建稳健空间心理模型方面的能力，包括位置表征（认知映射）、方向理解（视角采纳）及动态模拟（“假设”运动的心智模拟）。随后，我们探索了三种方法以助VLMs近似空间心理模型，包括未见中间视角的引入、自然语言推理链的应用及认知地图的构建。显著提升来自于一种协同策略——“先绘图后推理”，该策略联合训练模型首先生成认知地图，继而基于此进行推理。通过训练模型在这些内部地图上进行推理，我们将准确率从37.8%提升至60.8%（+23.0%）。引入强化学习后，性能进一步提升至70.7%（+32.9%）。我们的核心洞见在于，此类空间心理模型的支架作用，即主动构建并利用内部结构化空间表征，结合灵活的推理过程，显著增强了对不可观测空间的理解。

English

Can Vision Language Models (VLMs) imagine the full scene from just a few views, like humans do? Humans form spatial mental models, internal representations of unseen space, to reason about layout, perspective, and motion. Our new MindCube benchmark with 21,154 questions across 3,268 images exposes this critical gap, where existing VLMs exhibit near-random performance. Using MindCube, we systematically evaluate how well VLMs build robust spatial mental models through representing positions (cognitive mapping), orientations (perspective-taking), and dynamics (mental simulation for "what-if" movements). We then explore three approaches to help VLMs approximate spatial mental models, including unseen intermediate views, natural language reasoning chains, and cognitive maps. The significant improvement comes from a synergistic approach, "map-then-reason", that jointly trains the model to first generate a cognitive map and then reason upon it. By training models to reason over these internal maps, we boosted accuracy from 37.8% to 60.8% (+23.0%). Adding reinforcement learning pushed performance even further to 70.7% (+32.9%). Our key insight is that such scaffolding of spatial mental models, actively constructing and utilizing internal structured spatial representations with flexible reasoning processes, significantly improves understanding of unobservable space.

有限視角下的空間心智建模

Spatial Mental Modeling from Limited Views

摘要

Support