思维、行动、构建:基于视觉语言模型的零样本三维视觉定位代理框架
Think, Act, Build: An Agentic Framework with Vision Language Models for Zero-Shot 3D Visual Grounding
April 1, 2026
作者: Haibo Wang, Zihao Lin, Zhiyang Xu, Lifu Huang
cs.AI
摘要
三维视觉定位(3D-VG)旨在通过自然语言描述在三维场景中定位物体。尽管近期基于视觉语言模型(VLM)的探索已实现零样本能力,但这些方法通常受限于依赖预处理点云数据的静态流程,本质上将定位任务降级为候选区域匹配。为突破这一局限,我们的核心思路是将任务解耦:利用二维VLM解析复杂空间语义,同时基于确定性多视图几何实现三维结构实例化。基于此,我们提出"思考-行动-构建"动态智能框架,将3D-VG任务重构为直接处理原始RGB-D序列的生成式二维到三维重建范式。具体而言,在专用3D-VG技能引导下,我们的VLM智能体动态调用视觉工具跨二维帧追踪并重建目标。关键的是,为克服严格VLM语义追踪导致的多视图覆盖缺失,我们提出语义锚定几何扩展机制:首先在参考视频片段中锚定目标,继而利用多视图几何将其空间位置传播至未观测帧。这使得智能体能通过相机参数聚合多视图特征,"构建"目标的三维表征,直接将二维视觉线索映射至三维坐标。此外,针对现有基准中存在的参考歧义与类别错误等缺陷,我们手动修正错误查询以确保严谨评估。在ScanRefer和Nr3D数据集上的大量实验表明,本框架完全基于开源模型即显著超越现有零样本方法,甚至优于全监督基线。
English
3D Visual Grounding (3D-VG) aims to localize objects in 3D scenes via natural language descriptions. While recent advancements leveraging Vision-Language Models (VLMs) have explored zero-shot possibilities, they typically suffer from a static workflow relying on preprocessed 3D point clouds, essentially degrading grounding into proposal matching. To bypass this reliance, our core motivation is to decouple the task: leveraging 2D VLMs to resolve complex spatial semantics, while relying on deterministic multi-view geometry to instantiate the 3D structure. Driven by this insight, we propose "Think, Act, Build (TAB)", a dynamic agentic framework that reformulates 3D-VG tasks as a generative 2D-to-3D reconstruction paradigm operating directly on raw RGB-D streams. Specifically, guided by a specialized 3D-VG skill, our VLM agent dynamically invokes visual tools to track and reconstruct the target across 2D frames. Crucially, to overcome the multi-view coverage deficit caused by strict VLM semantic tracking, we introduce the Semantic-Anchored Geometric Expansion, a mechanism that first anchors the target in a reference video clip and then leverages multi-view geometry to propagate its spatial location across unobserved frames. This enables the agent to "Build" the target's 3D representation by aggregating these multi-view features via camera parameters, directly mapping 2D visual cues to 3D coordinates. Furthermore, to ensure rigorous assessment, we identify flaws such as reference ambiguity and category errors in existing benchmarks and manually refine the incorrect queries. Extensive experiments on ScanRefer and Nr3D demonstrate that our framework, relying entirely on open-source models, significantly outperforms previous zero-shot methods and even surpasses fully supervised baselines.