ChatPaper.aiChatPaper

思維3D:運用空間思維進行空間推理

Think3D: Thinking with Space for Spatial Reasoning

January 19, 2026
作者: Zaibin Zhang, Yuhan Wu, Lianjie Jia, Yifan Wang, Zhongbo Zhang, Yijiang Li, Binghao Ran, Fuxi Zhang, Zhuohan Sun, Zhenfei Yin, Lijun Wang, Huchuan Lu
cs.AI

摘要

理解與推理物理世界需要空間智能:即超越二維感知、解讀幾何結構、透視關係與空間聯繫的能力。儘管近期視覺大模型在視覺理解方面表現卓越,但其本質仍是二維感知器,難以實現真正的三維推理。我們提出Think3D框架,使視覺大模型代理能夠以三維空間進行思考。該框架通過利用從圖像或視頻中恢復點雲與相機姿態的三維重建模型,讓代理能夠基於相機操作及第一人稱/全局視角切換來主動操控空間,將空間推理轉化為互動式的三維思維鏈過程。無需額外訓練,Think3D即可顯著提升GPT-4.1、Gemini 2.5 Pro等先進模型的空間推理性能,在BLINK Multi-view和MindCube數據集上平均提升7.8%,在VSI-Bench上提升4.7%。我們進一步發現,對於空間探索能力較弱的小型模型,通過強化學習策略選擇信息豐富的視角與操作可帶來顯著增益:結合強化學習後,工具使用帶來的效益從0.7%提升至6.8%。研究表明,無需訓練的工具增強型空間探索是一條可行路徑,能推動多模態代理實現更靈活、類人的三維推理,從而開拓多模態智能的新維度。代碼與權重已開源於https://github.com/zhangzaibin/spagent。
English
Understanding and reasoning about the physical world requires spatial intelligence: the ability to interpret geometry, perspective, and spatial relations beyond 2D perception. While recent vision large models (VLMs) excel at visual understanding, they remain fundamentally 2D perceivers and struggle with genuine 3D reasoning. We introduce Think3D, a framework that enables VLM agents to think with 3D space. By leveraging 3D reconstruction models that recover point clouds and camera poses from images or videos, Think3D allows the agent to actively manipulate space through camera-based operations and ego/global-view switching, transforming spatial reasoning into an interactive 3D chain-of-thought process. Without additional training, Think3D significantly improves the spatial reasoning performance of advanced models such as GPT-4.1 and Gemini 2.5 Pro, yielding average gains of +7.8% on BLINK Multi-view and MindCube, and +4.7% on VSI-Bench. We further show that smaller models, which struggle with spatial exploration, benefit significantly from a reinforcement learning policy that enables the model to select informative viewpoints and operations. With RL, the benefit from tool usage increases from +0.7% to +6.8%. Our findings demonstrate that training-free, tool-augmented spatial exploration is a viable path toward more flexible and human-like 3D reasoning in multimodal agents, establishing a new dimension of multimodal intelligence. Code and weights are released at https://github.com/zhangzaibin/spagent.
PDF281January 22, 2026