VGGT-Det:挖掘VGGT内部先验实现传感器几何无关的多视角室内三维目标检测
VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection
March 1, 2026
作者: Yang Cao, Feize Wu, Dave Zhenyu Chen, Yingji Zhong, Lanqing Hong, Dan Xu
cs.AI
摘要
当前的多视角室内3D目标检测器依赖难以获取的传感器几何信息(即精确标定的多视角相机位姿)来将多视角信息融合为全局场景表示,这限制了其在真实场景中的部署。我们瞄准更实用的设定:无传感器几何信息的多视角室内3D目标检测,即不依赖传感器提供的几何输入(多视角位姿或深度)。近期视觉几何基座Transformer(VGGT)研究表明,强大的3D线索可直接从图像中推断。基于这一洞见,我们提出首个专为无几何约束多视角室内3D检测设计的框架VGGT-Det。不同于简单使用VGGT预测结果,我们的方法将VGGT编码器集成至基于Transformer的流程中。为有效利用VGGT内部语义与几何先验,我们引入两个创新核心组件:(1)注意力引导的查询生成:利用VGGT注意力图作为语义先验初始化物体查询,通过聚焦物体区域提升定位能力,同时保持全局空间结构;(2)查询驱动特征聚合:可学习的感知查询与物体查询交互以"感知"其需求,随后动态聚合VGGT各层级中将2D特征逐步提升至3D的多层次几何特征。实验表明,在无几何约束设定下,VGGT-Det在ScanNet和ARKitScenes数据集上的mAP@0.25分别显著超越最佳现有方法4.4和8.6个百分点。消融实验证实,我们的注意力引导查询生成与查询驱动特征聚合能有效利用VGGT内部学习的语义与几何先验。
English
Current multi-view indoor 3D object detectors rely on sensor geometry that is costly to obtain (i.e., precisely calibrated multi-view camera poses) to fuse multi-view information into a global scene representation, limiting deployment in real-world scenes. We target a more practical setting: Sensor-Geometry-Free (SG-Free) multi-view indoor 3D object detection, where there are no sensor-provided geometric inputs (multi-view poses or depth). Recent Visual Geometry Grounded Transformer (VGGT) shows that strong 3D cues can be inferred directly from images. Building on this insight, we present VGGT-Det, the first framework tailored for SG-Free multi-view indoor 3D object detection. Rather than merely consuming VGGT predictions, our method integrates VGGT encoder into a transformer-based pipeline. To effectively leverage both the semantic and geometric priors from inside VGGT, we introduce two novel key components: (i) Attention-Guided Query Generation (AG): exploits VGGT attention maps as semantic priors to initialize object queries, improving localization by focusing on object regions while preserving global spatial structure; (ii) Query-Driven Feature Aggregation (QD): a learnable See-Query interacts with object queries to 'see' what they need, and then dynamically aggregates multi-level geometric features across VGGT layers that progressively lift 2D features into 3D. Experiments show that VGGT-Det significantly surpasses the best-performing method in the SG-Free setting by 4.4 and 8.6 mAP@0.25 on ScanNet and ARKitScenes, respectively. Ablation study shows that VGGT's internally learned semantic and geometric priors can be effectively leveraged by our AG and QD.