3DiffTection:具有几何感知扩散特征的3D物体检测
3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features
November 7, 2023
作者: Chenfeng Xu, Huan Ling, Sanja Fidler, Or Litany
cs.AI
摘要
我们提出了3DiffTection,这是一种最先进的方法,用于从单个图像中检测3D物体,利用了来自3D感知扩散模型的特征。为了进行3D检测,标注大规模图像数据是资源密集且耗时的。最近,预训练的大型图像扩散模型已经成为有效的2D感知任务特征提取器。然而,这些特征最初是在配对的文本和图像数据上训练的,这些数据并不是针对3D任务进行优化的,并且在应用于目标数据时通常存在领域差距。我们的方法通过两种专门的调整策略来弥合这些差距:几何和语义。对于几何调整,我们微调扩散模型,以执行基于单个图像的新视图合成,引入了一种新颖的极线变换算子。这个任务符合两个基本标准:对3D感知的必要性和仅依赖于姿势图像数据,这些数据是readily available(例如,来自视频)且不需要手动标注。对于语义细化,我们进一步在带有检测监督的目标数据上训练模型。这两个调整阶段都使用ControlNet来保持原始特征功能的完整性。在最后一步中,我们利用这些增强的功能在多个虚拟视点上进行测试时间预测集成。通过我们的方法,我们获得了专为3D检测量身定制的3D感知特征,并在识别跨视点对应方面表现出色。因此,我们的模型成为一个强大的3D检测器,显著超越了先前的基准,例如,单视图3D检测的前例Cube-RCNN,在Omni3D-ARkitscene数据集上的AP3D提高了9.43%。此外,3DiffTection展示了强大的数据效率和对跨领域数据的泛化能力。
English
We present 3DiffTection, a state-of-the-art method for 3D object detection
from single images, leveraging features from a 3D-aware diffusion model.
Annotating large-scale image data for 3D detection is resource-intensive and
time-consuming. Recently, pretrained large image diffusion models have become
prominent as effective feature extractors for 2D perception tasks. However,
these features are initially trained on paired text and image data, which are
not optimized for 3D tasks, and often exhibit a domain gap when applied to the
target data. Our approach bridges these gaps through two specialized tuning
strategies: geometric and semantic. For geometric tuning, we fine-tune a
diffusion model to perform novel view synthesis conditioned on a single image,
by introducing a novel epipolar warp operator. This task meets two essential
criteria: the necessity for 3D awareness and reliance solely on posed image
data, which are readily available (e.g., from videos) and does not require
manual annotation. For semantic refinement, we further train the model on
target data with detection supervision. Both tuning phases employ ControlNet to
preserve the integrity of the original feature capabilities. In the final step,
we harness these enhanced capabilities to conduct a test-time prediction
ensemble across multiple virtual viewpoints. Through our methodology, we obtain
3D-aware features that are tailored for 3D detection and excel in identifying
cross-view point correspondences. Consequently, our model emerges as a powerful
3D detector, substantially surpassing previous benchmarks, e.g., Cube-RCNN, a
precedent in single-view 3D detection by 9.43\% in AP3D on the
Omni3D-ARkitscene dataset. Furthermore, 3DiffTection showcases robust data
efficiency and generalization to cross-domain data.