ChatPaper.aiChatPaper

WildDet3D:面向开放世界的可提示三维检测技术规模化实现

WildDet3D: Scaling Promptable 3D Detection in the Wild

April 9, 2026
作者: Weikai Huang, Jieyu Zhang, Sijun Li, Taoyang Jia, Jiafei Duan, Yunqian Cheng, Jaemin Cho, Mattew Wallingford, Rustin Soraki, Chris Dongjoo Kim, Donovan Clay, Taira Anderson, Winson Han, Ali Farhadi, Bharath Hariharan, Zhongzheng Ren, Ranjay Krishna
cs.AI

摘要

从单张图像中理解物体的三维结构是空间智能的基石。实现这一目标的关键步骤是单目三维目标检测——从输入的RGB图像中还原物体的尺寸、位置和朝向。为在开放世界中具备实用价值,此类检测器必须突破封闭类别限制实现泛化,支持多样化的提示模态,并能有效利用可用的几何线索。当前进展面临两大瓶颈:现有方法仅针对单一提示类型设计,缺乏融入额外几何线索的机制;现有三维数据集仅覆盖受控环境下的有限类别,制约了开放世界的迁移应用。本研究同时解决了这两个问题。首先,我们提出WildDet3D这一统一的地理感知架构,原生支持文本、点和框三种提示方式,并能在推理时融合辅助深度信号。其次,我们构建了迄今最大的开放三维检测数据集WildDet3D-Data,通过从现有二维标注生成候选三维框并仅保留人工验证结果,最终涵盖13.5万个类别超过100万张图像,覆盖多样化的真实场景。WildDet3D在多个基准测试和设定下均实现了最先进性能:在开放世界设定下,新提出的WildDet3D-Bench基准上文本/框提示分别达到22.6/24.8 AP3D;在Omni3D数据集上文本/框提示分别达到34.2/36.4 AP3D;在零样本评估中,于Argoverse 2和ScanNet上分别实现40.3/48.9 ODS。值得注意的是,推理时引入深度线索能带来显著性能提升(各设定平均提升+20.7 AP)。
English
Understanding objects in 3D from a single image is a cornerstone of spatial intelligence. A key step toward this goal is monocular 3D object detection--recovering the extent, location, and orientation of objects from an input RGB image. To be practical in the open world, such a detector must generalize beyond closed-set categories, support diverse prompt modalities, and leverage geometric cues when available. Progress is hampered by two bottlenecks: existing methods are designed for a single prompt type and lack a mechanism to incorporate additional geometric cues, and current 3D datasets cover only narrow categories in controlled environments, limiting open-world transfer. In this work we address both gaps. First, we introduce WildDet3D, a unified geometry-aware architecture that natively accepts text, point, and box prompts and can incorporate auxiliary depth signals at inference time. Second, we present WildDet3D-Data, the largest open 3D detection dataset to date, constructed by generating candidate 3D boxes from existing 2D annotations and retaining only human-verified ones, yielding over 1M images across 13.5K categories in diverse real-world scenes. WildDet3D establishes a new state-of-the-art across multiple benchmarks and settings. In the open-world setting, it achieves 22.6/24.8 AP3D on our newly introduced WildDet3D-Bench with text and box prompts. On Omni3D, it reaches 34.2/36.4 AP3D with text and box prompts, respectively. In zero-shot evaluation, it achieves 40.3/48.9 ODS on Argoverse 2 and ScanNet. Notably, incorporating depth cues at inference time yields substantial additional gains (+20.7 AP on average across settings).
PDF2164April 14, 2026