透過世界基礎實現空間推理
Reasoning in Space via Grounding in the World
October 15, 2025
作者: Yiming Chen, Zekun Qi, Wenyao Zhang, Xin Jin, Li Zhang, Peidong Liu
cs.AI
摘要
本文主張,三維視覺定位是空間推理的基石,並引入Grounded-Spatial Reasoner(GS-Reasoner)來探索能有效連接二者的空間表徵。現有的三維大型語言模型因缺乏能同時捕捉語義與幾何資訊的統一三維表徵而受限,這一缺陷表現為定位性能不佳或過度依賴外部模組,最終阻礙了定位與空間推理的無縫整合。為解決此問題,我們提出了一種簡單而有效的雙路徑池化機制,該機制緊密對齊幾何特徵與語義及位置線索,構建了一個基於圖像塊的統一三維表徵,該表徵囊括了所有必要資訊且不增加輸入標記的數量。利用這一全面表徵,GS-Reasoner成為首個完全無需外部模組即可實現自回歸定位的三維大型語言模型,其性能媲美頂尖模型,為三維空間推理建立了一個統一且自洽的框架。為進一步橋接定位與空間推理,我們引入了Grounded Chain-of-Thought(GCoT)數據集。該數據集精心設計,包含推理問題中提及物件的三維邊界框註釋,以及將定位作為問題解決過程核心的逐步推理路徑。大量實驗表明,GS-Reasoner在三維視覺定位上取得了令人印象深刻的成果,這反過來顯著提升了其空間推理能力,使其達到了頂尖性能。
English
In this paper, we claim that 3D visual grounding is the cornerstone of
spatial reasoning and introduce the Grounded-Spatial Reasoner (GS-Reasoner) to
explore the effective spatial representations that bridge the gap between them.
Existing 3D LLMs suffer from the absence of a unified 3D representation capable
of jointly capturing semantic and geometric information. This deficiency is
manifested either in poor performance on grounding or in an excessive reliance
on external modules, ultimately hindering the seamless integration of grounding
and spatial reasoning. To address this, we propose a simple yet effective
dual-path pooling mechanism that tightly aligns geometric features with both
semantic and positional cues, constructing a unified image patch-based 3D
representation that encapsulates all essential information without increasing
the number of input tokens. Leveraging this holistic representation,
GS-Reasoner is the first 3D LLM that achieves autoregressive grounding entirely
without external modules while delivering performance comparable to
state-of-the-art models, establishing a unified and self-contained framework
for 3D spatial reasoning. To further bridge grounding and spatial reasoning, we
introduce the Grounded Chain-of-Thought (GCoT) dataset. This dataset is
meticulously curated to include both 3D bounding box annotations for objects
referenced in reasoning questions and step-by-step reasoning paths that
integrate grounding as a core component of the problem-solving process.
Extensive experiments demonstrate that GS-Reasoner achieves impressive results
on 3D visual grounding, which in turn significantly enhances its spatial
reasoning capabilities, leading to state-of-the-art performance.