WildDet3D：擴展開放環境中的可提示三維檢測

摘要

從單一影像理解物體的3D結構是空間智能的基石。實現此目標的關鍵步驟是單目3D物體檢測——從輸入的RGB影像中恢復物體的尺寸、位置和朝向。要在開放世界中具備實用性，此類檢測器必須具備閉集類別之外的泛化能力、支持多樣化的提示模式，並能在可用時利用幾何線索。當前進展受到兩大瓶頸制約：現有方法僅針對單一提示類型設計，缺乏整合額外幾何線索的機制；現有3D數據集僅涵蓋受控環境下的有限類別，限制了開放世界的遷移能力。本研究同步解決這兩大難題。首先，我們提出WildDet3D——一種統一的幾何感知架構，原生支持文字、點位和邊框提示，並能在推理時整合輔助深度信號。其次，我們發布WildDet3D-Data，這是迄今最大的開放式3D檢測數據集，通過從現有2D標註生成候選3D邊框並僅保留人工驗證結果構建而成，涵蓋多樣化真實場景中的13.5萬個類別、超過100萬張影像。WildDet3D在多個基準測試和設定下創下新標竿：在開放世界設定中，於我們新推出的WildDet3D-Bench上以文字/邊框提示分別達到22.6/24.8 AP3D；在Omni3D數據集上以文字/邊框提示分別取得34.2/36.4 AP3D；在零樣本評估中，於Argoverse 2和ScanNet上實現40.3/48.9 ODS。值得注意的是，在推理時引入深度線索可帶來顯著增益（各設定平均提升+20.7 AP）。

English

Understanding objects in 3D from a single image is a cornerstone of spatial intelligence. A key step toward this goal is monocular 3D object detection--recovering the extent, location, and orientation of objects from an input RGB image. To be practical in the open world, such a detector must generalize beyond closed-set categories, support diverse prompt modalities, and leverage geometric cues when available. Progress is hampered by two bottlenecks: existing methods are designed for a single prompt type and lack a mechanism to incorporate additional geometric cues, and current 3D datasets cover only narrow categories in controlled environments, limiting open-world transfer. In this work we address both gaps. First, we introduce WildDet3D, a unified geometry-aware architecture that natively accepts text, point, and box prompts and can incorporate auxiliary depth signals at inference time. Second, we present WildDet3D-Data, the largest open 3D detection dataset to date, constructed by generating candidate 3D boxes from existing 2D annotations and retaining only human-verified ones, yielding over 1M images across 13.5K categories in diverse real-world scenes. WildDet3D establishes a new state-of-the-art across multiple benchmarks and settings. In the open-world setting, it achieves 22.6/24.8 AP3D on our newly introduced WildDet3D-Bench with text and box prompts. On Omni3D, it reaches 34.2/36.4 AP3D with text and box prompts, respectively. In zero-shot evaluation, it achieves 40.3/48.9 ODS on Argoverse 2 and ScanNet. Notably, incorporating depth cues at inference time yields substantial additional gains (+20.7 AP on average across settings).

WildDet3D：擴展開放環境中的可提示三維檢測

WildDet3D: Scaling Promptable 3D Detection in the Wild

摘要

Support