VGGT-Det：挖掘VGGT内部先验实现传感器几何无关的多视角室内三维目标检测

摘要

当前的多视角室内3D目标检测器依赖难以获取的传感器几何信息（即精确标定的多视角相机位姿）来将多视角信息融合为全局场景表示，这限制了其在真实场景中的部署。我们瞄准更实用的设定：无传感器几何信息的多视角室内3D目标检测，即不依赖传感器提供的几何输入（多视角位姿或深度）。近期视觉几何基座Transformer（VGGT）研究表明，强大的3D线索可直接从图像中推断。基于这一洞见，我们提出首个专为无几何约束多视角室内3D检测设计的框架VGGT-Det。不同于简单使用VGGT预测结果，我们的方法将VGGT编码器集成至基于Transformer的流程中。为有效利用VGGT内部语义与几何先验，我们引入两个创新核心组件：（1）注意力引导的查询生成：利用VGGT注意力图作为语义先验初始化物体查询，通过聚焦物体区域提升定位能力，同时保持全局空间结构；（2）查询驱动特征聚合：可学习的感知查询与物体查询交互以"感知"其需求，随后动态聚合VGGT各层级中将2D特征逐步提升至3D的多层次几何特征。实验表明，在无几何约束设定下，VGGT-Det在ScanNet和ARKitScenes数据集上的mAP@0.25分别显著超越最佳现有方法4.4和8.6个百分点。消融实验证实，我们的注意力引导查询生成与查询驱动特征聚合能有效利用VGGT内部学习的语义与几何先验。

English

Current multi-view indoor 3D object detectors rely on sensor geometry that is costly to obtain (i.e., precisely calibrated multi-view camera poses) to fuse multi-view information into a global scene representation, limiting deployment in real-world scenes. We target a more practical setting: Sensor-Geometry-Free (SG-Free) multi-view indoor 3D object detection, where there are no sensor-provided geometric inputs (multi-view poses or depth). Recent Visual Geometry Grounded Transformer (VGGT) shows that strong 3D cues can be inferred directly from images. Building on this insight, we present VGGT-Det, the first framework tailored for SG-Free multi-view indoor 3D object detection. Rather than merely consuming VGGT predictions, our method integrates VGGT encoder into a transformer-based pipeline. To effectively leverage both the semantic and geometric priors from inside VGGT, we introduce two novel key components: (i) Attention-Guided Query Generation (AG): exploits VGGT attention maps as semantic priors to initialize object queries, improving localization by focusing on object regions while preserving global spatial structure; (ii) Query-Driven Feature Aggregation (QD): a learnable See-Query interacts with object queries to 'see' what they need, and then dynamically aggregates multi-level geometric features across VGGT layers that progressively lift 2D features into 3D. Experiments show that VGGT-Det significantly surpasses the best-performing method in the SG-Free setting by 4.4 and 8.6 mAP@0.25 on ScanNet and ARKitScenes, respectively. Ablation study shows that VGGT's internally learned semantic and geometric priors can be effectively leveraged by our AG and QD.