VGGT-Det:挖掘VGGT內部先驗實現免傳感器幾何約束的多視角室內3D物體檢測
VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection
March 1, 2026
作者: Yang Cao, Feize Wu, Dave Zhenyu Chen, Yingji Zhong, Lanqing Hong, Dan Xu
cs.AI
摘要
當前多視角室內3D物體檢測方法依賴於獲取成本高昂的傳感器幾何數據(即精確校準的多視角相機位姿)來融合多視角信息至全局場景表徵,這限制了其實際場景中的部署能力。我們瞄準一種更實用的設定:無傳感器幾何約束的多視角室內3D物體檢測,該設定下不提供任何傳感器幾何輸入(多視角位姿或深度信息)。近期提出的視覺幾何基於Transformer模型表明,強健的3D線索可直接從圖像中推斷。基於此洞見,我們提出首個專為無傳感器幾何約束多視角室內3D檢測設計的框架VGGT-Det。有別於簡單使用VGGT預測結果,我們的方法將VGGT編碼器整合至基於Transformer的流程中。為有效利用VGGT內部的語義與幾何先驗,我們引入兩個創新核心組件:(1)注意力引導查詢生成:利用VGGT注意力圖作為語義先驗來初始化物體查詢,通過聚焦物體區域同時保持全局空間結構來提升定位能力;(2)查詢驅動特徵聚合:可學習的See-Query模塊與物體查詢交互以「感知」其需求,隨後動態聚合VGGT多層級的幾何特徵,逐步將2D特徵提升至3D表徵。實驗表明,VGGT-Det在ScanNet和ARKitScenes數據集上的mAP@0.25指標分別顯著超越現有最佳無傳感器幾何方法4.4和8.6個百分點。消融研究證實,我們提出的AG與QD組件能有效利用VGGT內部學習到的語義與幾何先驗。
English
Current multi-view indoor 3D object detectors rely on sensor geometry that is costly to obtain (i.e., precisely calibrated multi-view camera poses) to fuse multi-view information into a global scene representation, limiting deployment in real-world scenes. We target a more practical setting: Sensor-Geometry-Free (SG-Free) multi-view indoor 3D object detection, where there are no sensor-provided geometric inputs (multi-view poses or depth). Recent Visual Geometry Grounded Transformer (VGGT) shows that strong 3D cues can be inferred directly from images. Building on this insight, we present VGGT-Det, the first framework tailored for SG-Free multi-view indoor 3D object detection. Rather than merely consuming VGGT predictions, our method integrates VGGT encoder into a transformer-based pipeline. To effectively leverage both the semantic and geometric priors from inside VGGT, we introduce two novel key components: (i) Attention-Guided Query Generation (AG): exploits VGGT attention maps as semantic priors to initialize object queries, improving localization by focusing on object regions while preserving global spatial structure; (ii) Query-Driven Feature Aggregation (QD): a learnable See-Query interacts with object queries to 'see' what they need, and then dynamically aggregates multi-level geometric features across VGGT layers that progressively lift 2D features into 3D. Experiments show that VGGT-Det significantly surpasses the best-performing method in the SG-Free setting by 4.4 and 8.6 mAP@0.25 on ScanNet and ARKitScenes, respectively. Ablation study shows that VGGT's internally learned semantic and geometric priors can be effectively leveraged by our AG and QD.