3DiffTection：具有幾何感知擴散特徵的3D物體檢測

摘要

我們提出了3DiffTection，這是一種最先進的方法，用於從單張圖像中檢測3D物體，利用了來自3D感知擴散模型的特徵。為了進行3D檢測，對大規模圖像數據進行標註是資源密集且耗時的。最近，預訓練的大型圖像擴散模型已成為有效的2D感知任務特徵提取器。然而，這些特徵最初是在配對的文本和圖像數據上進行訓練的，這些數據並未針對3D任務進行優化，並且在應用於目標數據時通常存在領域差異。我們的方法通過兩種專門的調整策略來彌合這些差距：幾何和語義。對於幾何調整，我們對擴散模型進行微調，以執行條件為單張圖像的新視角合成，引入了一種新的極線變換運算子。這個任務滿足了兩個基本標準：對3D感知的必要性和僅依賴於姿態圖像數據的要求，這些數據是readily available（例如，從視頻中獲取）並且不需要手動標註。對於語義細化，我們進一步在具有檢測監督的目標數據上訓練模型。兩個調整階段都使用ControlNet來保持原始特徵功能的完整性。在最後一步中，我們利用這些增強的功能來在多個虛擬視角上進行測試時間預測集成。通過我們的方法，我們獲得了針對3D檢測量身定制的3D感知特徵，並在識別跨視點對應方面表現出色。因此，我們的模型成為一個強大的3D檢測器，明顯超越了先前的基準，例如Cube-RCNN，在Omni3D-ARkitscene數據集的AP3D上比單視圖3D檢測的先例提高了9.43％。此外，3DiffTection展示了強大的數據效率和對跨領域數據的泛化能力。

English

We present 3DiffTection, a state-of-the-art method for 3D object detection from single images, leveraging features from a 3D-aware diffusion model. Annotating large-scale image data for 3D detection is resource-intensive and time-consuming. Recently, pretrained large image diffusion models have become prominent as effective feature extractors for 2D perception tasks. However, these features are initially trained on paired text and image data, which are not optimized for 3D tasks, and often exhibit a domain gap when applied to the target data. Our approach bridges these gaps through two specialized tuning strategies: geometric and semantic. For geometric tuning, we fine-tune a diffusion model to perform novel view synthesis conditioned on a single image, by introducing a novel epipolar warp operator. This task meets two essential criteria: the necessity for 3D awareness and reliance solely on posed image data, which are readily available (e.g., from videos) and does not require manual annotation. For semantic refinement, we further train the model on target data with detection supervision. Both tuning phases employ ControlNet to preserve the integrity of the original feature capabilities. In the final step, we harness these enhanced capabilities to conduct a test-time prediction ensemble across multiple virtual viewpoints. Through our methodology, we obtain 3D-aware features that are tailored for 3D detection and excel in identifying cross-view point correspondences. Consequently, our model emerges as a powerful 3D detector, substantially surpassing previous benchmarks, e.g., Cube-RCNN, a precedent in single-view 3D detection by 9.43\% in AP3D on the Omni3D-ARkitscene dataset. Furthermore, 3DiffTection showcases robust data efficiency and generalization to cross-domain data.

3DiffTection：具有幾何感知擴散特徵的3D物體檢測

3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features

摘要

Support