三维思维：基于几何想象的空间推理从有限视角出发

摘要

尽管视觉-语言模型（VLMs）的最新进展在多种多模态任务中取得了显著进步，但从有限视角理解三维空间关系仍是一个重大挑战。以往推理方法通常依赖于纯文本（如拓扑认知地图）或二维视觉线索。然而，这些方法有限的表征能力在需要三维空间想象力的特定任务中表现受限。为解决这一局限，我们提出了3DThinker框架，该框架能有效利用图像中蕴含的丰富几何信息进行推理，如同人类一般。我们的框架首次实现了在推理过程中无需任何三维先验输入即可进行三维心智模拟，且不依赖显式标注的三维数据进行训练。具体而言，我们的训练分为两个阶段：首先，我们通过监督训练使VLM在推理过程中生成的三维潜在表示与三维基础模型（如VGGT）对齐；随后，我们仅基于结果信号优化整个推理轨迹，从而精炼底层的三维心智模拟。在多个基准测试上的广泛实验表明，3DThinker持续超越强基线，并为将三维表征统一到多模态推理中提供了新视角。我们的代码将在https://github.com/zhangquanchen/3DThinker 公开。

English

Though recent advances in vision-language models (VLMs) have achieved remarkable progress across a wide range of multimodal tasks, understanding 3D spatial relationships from limited views remains a significant challenge. Previous reasoning methods typically rely on pure text (e.g., topological cognitive maps) or on 2D visual cues. However, their limited representational capacity hinders performance in specific tasks that require 3D spatial imagination. To address this limitation, we propose 3DThinker, a framework that can effectively exploits the rich geometric information embedded within images while reasoning, like humans do. Our framework is the first to enable 3D mentaling during reasoning without any 3D prior input, and it does not rely on explicitly labeled 3D data for training. Specifically, our training consists of two stages. First, we perform supervised training to align the 3D latent generated by VLM while reasoning with that of a 3D foundation model (e.g., VGGT). Then, we optimize the entire reasoning trajectory solely based on outcome signals, thereby refining the underlying 3D mentaling. Extensive experiments across multiple benchmarks show that 3DThinker consistently outperforms strong baselines and offers a new perspective toward unifying 3D representations into multimodal reasoning. Our code will be available at https://github.com/zhangquanchen/3DThinker.

三维思维：基于几何想象的空间推理从有限视角出发

Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views

摘要

Support