思考、行动、构建:基于视觉语言模型的零样本三维视觉定位智能体框架
Think, Act, Build: An Agentic Framework with Vision Language Models for Zero-Shot 3D Visual Grounding
April 1, 2026
作者: Haibo Wang, Zihao Lin, Zhiyang Xu, Lifu Huang
cs.AI
摘要
三維視覺定位(3D-VG)旨在通過自然語言描述在3D場景中定位物體。儘管近期基於視覺語言模型(VLM)的探索已實現零樣本能力,但這些方法通常依賴預處理的3D點雲靜態工作流,實質上將定位任務退化為候選框匹配。為突破此限制,我們的核心思路是解耦任務:利用2D VLM解析複雜空間語義,同時依賴確定性多視圖幾何來實例化3D結構。基於此洞見,我們提出"思考-行動-構建"動態智能框架,將3D-VG任務重構為直接處理原始RGB-D數據流的生成式二維到三維重建範式。具體而言,在專用3D-VG技能引導下,我們的VLM智能體動態調用視覺工具在二維幀序列中追蹤並重建目標。關鍵在於,為克服嚴格VLM語義追蹤導致的多視角覆蓋缺失,我們提出語義錨定幾何擴展機制:先將目標錨定於參考視頻片段,再利用多視圖幾何將其空間位置傳播至未觀測幀。這使得智能體能通過相機參數聚合多視角特徵,"構建"目標的3D表徵,直接將二維視覺線索映射至三維坐標。此外,為確保嚴謹評估,我們發現現有基準中存在參考歧義與類別錯誤等缺陷,並手動修正了錯誤查詢。在ScanRefer與Nr3D數據集上的大量實驗表明,本框架僅依賴開源模型即顯著超越現有零樣本方法,甚至優於全監督基線。
English
3D Visual Grounding (3D-VG) aims to localize objects in 3D scenes via natural language descriptions. While recent advancements leveraging Vision-Language Models (VLMs) have explored zero-shot possibilities, they typically suffer from a static workflow relying on preprocessed 3D point clouds, essentially degrading grounding into proposal matching. To bypass this reliance, our core motivation is to decouple the task: leveraging 2D VLMs to resolve complex spatial semantics, while relying on deterministic multi-view geometry to instantiate the 3D structure. Driven by this insight, we propose "Think, Act, Build (TAB)", a dynamic agentic framework that reformulates 3D-VG tasks as a generative 2D-to-3D reconstruction paradigm operating directly on raw RGB-D streams. Specifically, guided by a specialized 3D-VG skill, our VLM agent dynamically invokes visual tools to track and reconstruct the target across 2D frames. Crucially, to overcome the multi-view coverage deficit caused by strict VLM semantic tracking, we introduce the Semantic-Anchored Geometric Expansion, a mechanism that first anchors the target in a reference video clip and then leverages multi-view geometry to propagate its spatial location across unobserved frames. This enables the agent to "Build" the target's 3D representation by aggregating these multi-view features via camera parameters, directly mapping 2D visual cues to 3D coordinates. Furthermore, to ensure rigorous assessment, we identify flaws such as reference ambiguity and category errors in existing benchmarks and manually refine the incorrect queries. Extensive experiments on ScanRefer and Nr3D demonstrate that our framework, relying entirely on open-source models, significantly outperforms previous zero-shot methods and even surpasses fully supervised baselines.