WildDet3D: 実世界におけるプロンプト可能な3D検出のスケーリング

要旨

単一画像から物体を3Dで理解することは、空間知能の基盤となる。この目標に向けた重要なステップが、単眼カメラによる3D物体検出である。これは入力RGB画像から物体の範囲、位置、向きを復元する技術だ。実世界で実用的であるためには、こうした検出器は閉じたカテゴリを超えた一般化能力、多様なプロンプト形態への対応、そして利用可能な場合の幾何学的手がかりの活用が求められる。現在の進展は二つのボトルネックによって妨げられている。既存手法は単一のプロンプト種類向けに設計され、追加の幾何学的な手がかりを取り込む機構を欠いていること。また、現行の3Dデータセットは制御環境下の限られたカテゴリしかカバーせず、実世界への転移を制限していること。本研究では両方の課題に取り組む。まず、WildDet3Dを提案する。これはテキスト、点、ボックスプロンプトをネイティブに受け入れ、推論時に補助的な深度信号を組み込める、統一された幾何学認識アーキテクチャである。次に、現存する2D注釈から候補3Dボックスを生成し、人間が検証したもののみを保持することで構築した、現在最大のオープン3D検出データセットWildDet3D-Dataを提示する。これにより、多様な実世界シーンにおける13.5Kカテゴリ、100万枚以上の画像が得られた。WildDet3Dは、複数のベンチマークと設定において新たな最高性能を確立した。オープンワールド設定では、新たに導入したWildDet3D-Benchにおいて、テキストおよびボックスプロンプトで22.6/24.8 AP3Dを達成。Omni3Dでは、テキストおよびボックスプロンプトでそれぞれ34.2/36.4 AP3Dに到達した。ゼロショット評価では、Argoverse 2とScanNetで40.3/48.9 ODSを達成。特筆すべきは、推論時に深度手がかりを組み込むことで、すべての設定平均で大幅な性能向上（平均+20.7 AP）が得られた点である。

English

Understanding objects in 3D from a single image is a cornerstone of spatial intelligence. A key step toward this goal is monocular 3D object detection--recovering the extent, location, and orientation of objects from an input RGB image. To be practical in the open world, such a detector must generalize beyond closed-set categories, support diverse prompt modalities, and leverage geometric cues when available. Progress is hampered by two bottlenecks: existing methods are designed for a single prompt type and lack a mechanism to incorporate additional geometric cues, and current 3D datasets cover only narrow categories in controlled environments, limiting open-world transfer. In this work we address both gaps. First, we introduce WildDet3D, a unified geometry-aware architecture that natively accepts text, point, and box prompts and can incorporate auxiliary depth signals at inference time. Second, we present WildDet3D-Data, the largest open 3D detection dataset to date, constructed by generating candidate 3D boxes from existing 2D annotations and retaining only human-verified ones, yielding over 1M images across 13.5K categories in diverse real-world scenes. WildDet3D establishes a new state-of-the-art across multiple benchmarks and settings. In the open-world setting, it achieves 22.6/24.8 AP3D on our newly introduced WildDet3D-Bench with text and box prompts. On Omni3D, it reaches 34.2/36.4 AP3D with text and box prompts, respectively. In zero-shot evaluation, it achieves 40.3/48.9 ODS on Argoverse 2 and ScanNet. Notably, incorporating depth cues at inference time yields substantial additional gains (+20.7 AP on average across settings).

WildDet3D: 実世界におけるプロンプト可能な3D検出のスケーリング

WildDet3D: Scaling Promptable 3D Detection in the Wild

要旨

Support