思考、行動、構築：ゼロショット3D視覚的グラウンディングのためのビジョン言語モデルを用いたエージェンシックフレームワーク

要旨

3D視覚的接地（3D-VG）は、自然言語記述を用いて3Dシーン内の物体を位置特定することを目的とする。視覚言語モデル（VLM）を活用した最近の進展はゼロショット可能性を探求しているが、これらは一般に前処理済み3D点群に依存する静的なワークフローに悩まされ、接地を提案マッチングに実質的に退化させてしまう。この依存性を回避するため、我々の中核的動機はタスクを分離することにある。すなわち、複雑な空間的意味論を解決するために2D VLMを活用し、3D構造を具体化するために決定論的多視点幾何学に依存する。この洞察に基づき、我々は「Think, Act, Build (TAB)」を提案する。これは、3D-VGタスクを生のRGB-Dストリーム上で直接動作する生成的2D-to-3D再構成パラダイムとして再定義する動的なエージェントフレームワークである。具体的には、専門化された3D-VGスキルに導かれて、我々のVLMエージェントは視覚ツールを動的に呼び出し、2Dフレーム間で対象を追跡・再構築する。決定的に、厳格なVLM意味論的追跡によって生じる多視点カバレッジ不足を克服するため、我々はSemantic-Anchored Geometric Expansionを導入する。この機構は、まず参照ビデオクリップ内に対象を固定し（アンカー）、その後、多視点幾何学を活用してその空間的位置を未観測フレーム群へ伝播させる。これにより、エージェントはカメラパラメータを介してこれらの多視点特徴を集約し、2Dの視覚的手がかりを3D座標へ直接マッピングすることで、対象の3D表現を「構築（Build）」することが可能となる。さらに、厳密な評価を確保するため、我々は既存ベンチマークにおける参照曖昧性やカテゴリ誤りなどの欠陥を特定し、不正確なクエリを手動で精緻化した。ScanReferおよびNr3Dにおける大規模な実験により、オープンソースモデルのみに完全に依存する我々のフレームワークが、従来のゼロショット手法を大幅に上回り、完全教師ありベースラインをも凌駕することを実証した。

English

3D Visual Grounding (3D-VG) aims to localize objects in 3D scenes via natural language descriptions. While recent advancements leveraging Vision-Language Models (VLMs) have explored zero-shot possibilities, they typically suffer from a static workflow relying on preprocessed 3D point clouds, essentially degrading grounding into proposal matching. To bypass this reliance, our core motivation is to decouple the task: leveraging 2D VLMs to resolve complex spatial semantics, while relying on deterministic multi-view geometry to instantiate the 3D structure. Driven by this insight, we propose "Think, Act, Build (TAB)", a dynamic agentic framework that reformulates 3D-VG tasks as a generative 2D-to-3D reconstruction paradigm operating directly on raw RGB-D streams. Specifically, guided by a specialized 3D-VG skill, our VLM agent dynamically invokes visual tools to track and reconstruct the target across 2D frames. Crucially, to overcome the multi-view coverage deficit caused by strict VLM semantic tracking, we introduce the Semantic-Anchored Geometric Expansion, a mechanism that first anchors the target in a reference video clip and then leverages multi-view geometry to propagate its spatial location across unobserved frames. This enables the agent to "Build" the target's 3D representation by aggregating these multi-view features via camera parameters, directly mapping 2D visual cues to 3D coordinates. Furthermore, to ensure rigorous assessment, we identify flaws such as reference ambiguity and category errors in existing benchmarks and manually refine the incorrect queries. Extensive experiments on ScanRefer and Nr3D demonstrate that our framework, relying entirely on open-source models, significantly outperforms previous zero-shot methods and even surpasses fully supervised baselines.

思考、行動、構築：ゼロショット3D視覚的グラウンディングのためのビジョン言語モデルを用いたエージェンシックフレームワーク

Think, Act, Build: An Agentic Framework with Vision Language Models for Zero-Shot 3D Visual Grounding

要旨

Support