ChatPaper.aiChatPaper

通过世界基础实现空间推理

Reasoning in Space via Grounding in the World

October 15, 2025
作者: Yiming Chen, Zekun Qi, Wenyao Zhang, Xin Jin, Li Zhang, Peidong Liu
cs.AI

摘要

本文主张,三维视觉定位是空间推理的基石,并引入Grounded-Spatial Reasoner(GS-Reasoner)以探索有效连接二者的空间表征方法。现有三维大语言模型(3D LLMs)因缺乏能同时捕捉语义与几何信息的统一三维表征而受限,这一缺陷表现为定位性能不佳或过度依赖外部模块,最终阻碍了定位与空间推理的无缝整合。为此,我们提出了一种简单而有效的双路径池化机制,该机制紧密对齐几何特征与语义及位置线索,构建了一个基于图像块的统一三维表征,该表征囊括了所有关键信息且未增加输入令牌数量。依托这一全面表征,GS-Reasoner成为首个无需外部模块即可实现自回归定位的三维大语言模型,其性能媲美顶尖模型,为三维空间推理建立了一个统一且自洽的框架。为进一步弥合定位与空间推理,我们引入了Grounded Chain-of-Thought(GCoT)数据集。该数据集精心设计,包含推理问题中提及物体的三维边界框标注,以及将定位作为问题解决核心环节的逐步推理路径。大量实验表明,GS-Reasoner在三维视觉定位上取得了令人瞩目的成果,进而显著提升了其空间推理能力,达到了业界领先水平。
English
In this paper, we claim that 3D visual grounding is the cornerstone of spatial reasoning and introduce the Grounded-Spatial Reasoner (GS-Reasoner) to explore the effective spatial representations that bridge the gap between them. Existing 3D LLMs suffer from the absence of a unified 3D representation capable of jointly capturing semantic and geometric information. This deficiency is manifested either in poor performance on grounding or in an excessive reliance on external modules, ultimately hindering the seamless integration of grounding and spatial reasoning. To address this, we propose a simple yet effective dual-path pooling mechanism that tightly aligns geometric features with both semantic and positional cues, constructing a unified image patch-based 3D representation that encapsulates all essential information without increasing the number of input tokens. Leveraging this holistic representation, GS-Reasoner is the first 3D LLM that achieves autoregressive grounding entirely without external modules while delivering performance comparable to state-of-the-art models, establishing a unified and self-contained framework for 3D spatial reasoning. To further bridge grounding and spatial reasoning, we introduce the Grounded Chain-of-Thought (GCoT) dataset. This dataset is meticulously curated to include both 3D bounding box annotations for objects referenced in reasoning questions and step-by-step reasoning paths that integrate grounding as a core component of the problem-solving process. Extensive experiments demonstrate that GS-Reasoner achieves impressive results on 3D visual grounding, which in turn significantly enhances its spatial reasoning capabilities, leading to state-of-the-art performance.
PDF142October 16, 2025