WildDet3D:擴展開放環境中的可提示三維檢測
WildDet3D: Scaling Promptable 3D Detection in the Wild
April 9, 2026
作者: Weikai Huang, Jieyu Zhang, Sijun Li, Taoyang Jia, Jiafei Duan, Yunqian Cheng, Jaemin Cho, Mattew Wallingford, Rustin Soraki, Chris Dongjoo Kim, Donovan Clay, Taira Anderson, Winson Han, Ali Farhadi, Bharath Hariharan, Zhongzheng Ren, Ranjay Krishna
cs.AI
摘要
從單一影像理解物體的3D結構是空間智能的基石。實現此目標的關鍵步驟是單目3D物體檢測——從輸入的RGB影像中恢復物體的尺寸、位置和朝向。要在開放世界中具備實用性,此類檢測器必須具備閉集類別之外的泛化能力、支持多樣化的提示模式,並能在可用時利用幾何線索。當前進展受到兩大瓶頸制約:現有方法僅針對單一提示類型設計,缺乏整合額外幾何線索的機制;現有3D數據集僅涵蓋受控環境下的有限類別,限制了開放世界的遷移能力。本研究同步解決這兩大難題。首先,我們提出WildDet3D——一種統一的幾何感知架構,原生支持文字、點位和邊框提示,並能在推理時整合輔助深度信號。其次,我們發布WildDet3D-Data,這是迄今最大的開放式3D檢測數據集,通過從現有2D標註生成候選3D邊框並僅保留人工驗證結果構建而成,涵蓋多樣化真實場景中的13.5萬個類別、超過100萬張影像。WildDet3D在多個基準測試和設定下創下新標竿:在開放世界設定中,於我們新推出的WildDet3D-Bench上以文字/邊框提示分別達到22.6/24.8 AP3D;在Omni3D數據集上以文字/邊框提示分別取得34.2/36.4 AP3D;在零樣本評估中,於Argoverse 2和ScanNet上實現40.3/48.9 ODS。值得注意的是,在推理時引入深度線索可帶來顯著增益(各設定平均提升+20.7 AP)。
English
Understanding objects in 3D from a single image is a cornerstone of spatial intelligence. A key step toward this goal is monocular 3D object detection--recovering the extent, location, and orientation of objects from an input RGB image. To be practical in the open world, such a detector must generalize beyond closed-set categories, support diverse prompt modalities, and leverage geometric cues when available. Progress is hampered by two bottlenecks: existing methods are designed for a single prompt type and lack a mechanism to incorporate additional geometric cues, and current 3D datasets cover only narrow categories in controlled environments, limiting open-world transfer. In this work we address both gaps. First, we introduce WildDet3D, a unified geometry-aware architecture that natively accepts text, point, and box prompts and can incorporate auxiliary depth signals at inference time. Second, we present WildDet3D-Data, the largest open 3D detection dataset to date, constructed by generating candidate 3D boxes from existing 2D annotations and retaining only human-verified ones, yielding over 1M images across 13.5K categories in diverse real-world scenes. WildDet3D establishes a new state-of-the-art across multiple benchmarks and settings. In the open-world setting, it achieves 22.6/24.8 AP3D on our newly introduced WildDet3D-Bench with text and box prompts. On Omni3D, it reaches 34.2/36.4 AP3D with text and box prompts, respectively. In zero-shot evaluation, it achieves 40.3/48.9 ODS on Argoverse 2 and ScanNet. Notably, incorporating depth cues at inference time yields substantial additional gains (+20.7 AP on average across settings).