三维思维:基于几何想象的空间推理从有限视角出发
Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views
October 21, 2025
作者: Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xufang Luo, Mingze Sun, Zihao Pan, Yan Feng, Peng Pei, Xunliang Cai, Ruqi Huang
cs.AI
摘要
尽管视觉-语言模型(VLMs)的最新进展在多种多模态任务中取得了显著进步,但从有限视角理解三维空间关系仍是一个重大挑战。以往推理方法通常依赖于纯文本(如拓扑认知地图)或二维视觉线索。然而,这些方法有限的表征能力在需要三维空间想象力的特定任务中表现受限。为解决这一局限,我们提出了3DThinker框架,该框架能有效利用图像中蕴含的丰富几何信息进行推理,如同人类一般。我们的框架首次实现了在推理过程中无需任何三维先验输入即可进行三维心智模拟,且不依赖显式标注的三维数据进行训练。具体而言,我们的训练分为两个阶段:首先,我们通过监督训练使VLM在推理过程中生成的三维潜在表示与三维基础模型(如VGGT)对齐;随后,我们仅基于结果信号优化整个推理轨迹,从而精炼底层的三维心智模拟。在多个基准测试上的广泛实验表明,3DThinker持续超越强基线,并为将三维表征统一到多模态推理中提供了新视角。我们的代码将在https://github.com/zhangquanchen/3DThinker 公开。
English
Though recent advances in vision-language models (VLMs) have achieved
remarkable progress across a wide range of multimodal tasks, understanding 3D
spatial relationships from limited views remains a significant challenge.
Previous reasoning methods typically rely on pure text (e.g., topological
cognitive maps) or on 2D visual cues. However, their limited representational
capacity hinders performance in specific tasks that require 3D spatial
imagination. To address this limitation, we propose 3DThinker, a framework that
can effectively exploits the rich geometric information embedded within images
while reasoning, like humans do. Our framework is the first to enable 3D
mentaling during reasoning without any 3D prior input, and it does not rely on
explicitly labeled 3D data for training. Specifically, our training consists of
two stages. First, we perform supervised training to align the 3D latent
generated by VLM while reasoning with that of a 3D foundation model (e.g.,
VGGT). Then, we optimize the entire reasoning trajectory solely based on
outcome signals, thereby refining the underlying 3D mentaling. Extensive
experiments across multiple benchmarks show that 3DThinker consistently
outperforms strong baselines and offers a new perspective toward unifying 3D
representations into multimodal reasoning. Our code will be available at
https://github.com/zhangquanchen/3DThinker.