WildDet3D：面向开放世界的可提示三维检测技术规模化实现

摘要

从单张图像中理解物体的三维结构是空间智能的基石。实现这一目标的关键步骤是单目三维目标检测——从输入的RGB图像中还原物体的尺寸、位置和朝向。为在开放世界中具备实用价值，此类检测器必须突破封闭类别限制实现泛化，支持多样化的提示模态，并能有效利用可用的几何线索。当前进展面临两大瓶颈：现有方法仅针对单一提示类型设计，缺乏融入额外几何线索的机制；现有三维数据集仅覆盖受控环境下的有限类别，制约了开放世界的迁移应用。本研究同时解决了这两个问题。首先，我们提出WildDet3D这一统一的地理感知架构，原生支持文本、点和框三种提示方式，并能在推理时融合辅助深度信号。其次，我们构建了迄今最大的开放三维检测数据集WildDet3D-Data，通过从现有二维标注生成候选三维框并仅保留人工验证结果，最终涵盖13.5万个类别超过100万张图像，覆盖多样化的真实场景。WildDet3D在多个基准测试和设定下均实现了最先进性能：在开放世界设定下，新提出的WildDet3D-Bench基准上文本/框提示分别达到22.6/24.8 AP3D；在Omni3D数据集上文本/框提示分别达到34.2/36.4 AP3D；在零样本评估中，于Argoverse 2和ScanNet上分别实现40.3/48.9 ODS。值得注意的是，推理时引入深度线索能带来显著性能提升（各设定平均提升+20.7 AP）。

English

Understanding objects in 3D from a single image is a cornerstone of spatial intelligence. A key step toward this goal is monocular 3D object detection--recovering the extent, location, and orientation of objects from an input RGB image. To be practical in the open world, such a detector must generalize beyond closed-set categories, support diverse prompt modalities, and leverage geometric cues when available. Progress is hampered by two bottlenecks: existing methods are designed for a single prompt type and lack a mechanism to incorporate additional geometric cues, and current 3D datasets cover only narrow categories in controlled environments, limiting open-world transfer. In this work we address both gaps. First, we introduce WildDet3D, a unified geometry-aware architecture that natively accepts text, point, and box prompts and can incorporate auxiliary depth signals at inference time. Second, we present WildDet3D-Data, the largest open 3D detection dataset to date, constructed by generating candidate 3D boxes from existing 2D annotations and retaining only human-verified ones, yielding over 1M images across 13.5K categories in diverse real-world scenes. WildDet3D establishes a new state-of-the-art across multiple benchmarks and settings. In the open-world setting, it achieves 22.6/24.8 AP3D on our newly introduced WildDet3D-Bench with text and box prompts. On Omni3D, it reaches 34.2/36.4 AP3D with text and box prompts, respectively. In zero-shot evaluation, it achieves 40.3/48.9 ODS on Argoverse 2 and ScanNet. Notably, incorporating depth cues at inference time yields substantial additional gains (+20.7 AP on average across settings).