VGGT-Det: VGGTの内部事前情報を活用したセンサー配置に依存しないマルチビュー室内3D物体検出

要旨

現在のマルチビュー屋内3D物体検出器は、マルチビュー情報をグローバルなシーン表現に融合させるために、取得コストが高いセンサー幾何学（すなわち、精密に較正されたマルチビューカメラポーズ）に依存しており、実世界シーンへの展開を制限している。我々は、より実用的な設定、すなわちセンサー幾何学情報不要（SG-Free）のマルチビュー屋内3D物体検出を対象とする。この設定では、センサーから提供される幾何学的入力（マルチビューポーズや深度）は存在しない。最近のVisual Geometry Grounded Transformer（VGGT）は、強力な3D手がかりが画像から直接推論できることを示している。この知見に基づき、我々はSG-Freeマルチビュー屋内3D物体検出に特化した初のフレームワークであるVGGT-Detを提案する。我々の手法は、VGGTの予測結果を単に利用するのではなく、VGGTエンコーダをTransformerベースのパイプラインに統合する。VGGT内部の意味的および幾何学的事前情報を効果的に活用するために、我々は二つの新規キーコンポーネントを導入する：(i) 注意誘導クエリ生成（AG）：VGGTの注意マップを意味的事前情報として利用し、物体領域に焦点を当てつつグローバルな空間構造を保持することで位置特定を改善する物体クエリを初期化する。(ii) クエリ駆動特徴量集約（QD）：学習可能なSee-Queryが物体クエリと相互作用してそれらが必要とするものを「見て」、その後、2D特徴量を段階的に3Dへと昇華するVGGT層にわたるマルチレベル幾何学特徴量を動的に集約する。実験により、VGGT-DetはSG-Free設定において最高性能の手法を、ScanNetおよびARKitScenesでそれぞれ4.4および8.6 mAP@0.25で有意に上回ることを示す。アブレーションスタディは、VGGT内部で学習された意味的・幾何学的事前情報が我々のAGとQDによって効果的に活用できることを示している。

English

Current multi-view indoor 3D object detectors rely on sensor geometry that is costly to obtain (i.e., precisely calibrated multi-view camera poses) to fuse multi-view information into a global scene representation, limiting deployment in real-world scenes. We target a more practical setting: Sensor-Geometry-Free (SG-Free) multi-view indoor 3D object detection, where there are no sensor-provided geometric inputs (multi-view poses or depth). Recent Visual Geometry Grounded Transformer (VGGT) shows that strong 3D cues can be inferred directly from images. Building on this insight, we present VGGT-Det, the first framework tailored for SG-Free multi-view indoor 3D object detection. Rather than merely consuming VGGT predictions, our method integrates VGGT encoder into a transformer-based pipeline. To effectively leverage both the semantic and geometric priors from inside VGGT, we introduce two novel key components: (i) Attention-Guided Query Generation (AG): exploits VGGT attention maps as semantic priors to initialize object queries, improving localization by focusing on object regions while preserving global spatial structure; (ii) Query-Driven Feature Aggregation (QD): a learnable See-Query interacts with object queries to 'see' what they need, and then dynamically aggregates multi-level geometric features across VGGT layers that progressively lift 2D features into 3D. Experiments show that VGGT-Det significantly surpasses the best-performing method in the SG-Free setting by 4.4 and 8.6 mAP@0.25 on ScanNet and ARKitScenes, respectively. Ablation study shows that VGGT's internally learned semantic and geometric priors can be effectively leveraged by our AG and QD.

VGGT-Det: VGGTの内部事前情報を活用したセンサー配置に依存しないマルチビュー室内3D物体検出

VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection

要旨

Support