3DiffTection: 幾何認識拡散特徴を用いた3D物体検出

要旨

本論文では、3D-aware拡散モデルから得られる特徴量を活用し、単一画像からの3D物体検出を行う最先端手法「3DiffTection」を提案する。大規模な画像データに3D検出用のアノテーションを付与する作業は、リソースと時間を要する課題である。近年、事前学習済みの大規模画像拡散モデルが、2D知覚タスクにおける効果的な特徴抽出器として注目を集めている。しかし、これらの特徴量はテキストと画像のペアデータで初期学習されており、3Dタスクに最適化されていないため、ターゲットデータに適用する際にドメインギャップが生じることが多い。我々のアプローチでは、幾何学的チューニングと意味的チューニングという2つの専門的な調整戦略を通じて、これらのギャップを埋める。幾何学的チューニングでは、新たなエピポーラワープ演算子を導入し、単一画像を条件とした新規視点合成を行うために拡散モデルをファインチューニングする。このタスクは、3D認識の必要性と、手動アノテーションを必要とせずに容易に入手可能な（例えば動画からの）ポーズ付き画像データのみに依存するという2つの重要な基準を満たしている。意味的リファインメントでは、検出の教師信号を用いてターゲットデータ上でモデルをさらに学習させる。両チューニングフェーズでは、ControlNetを使用して元の特徴量能力の完全性を維持する。最終段階では、これらの強化された能力を活用し、複数の仮想視点にわたるテスト時予測アンサンブルを実施する。我々の手法により、3D検出に特化した3D-aware特徴量を獲得し、クロスビューの点対応関係の識別に優れた性能を発揮する。その結果、我々のモデルは強力な3D検出器として登場し、Omni3D-ARkitsceneデータセットにおいて、単一視点3D検出の先駆けであるCube-RCNNをAP3Dで9.43%上回るなど、従来のベンチマークを大幅に凌駕する。さらに、3DiffTectionは、データ効率の高さとクロスドメインデータへの汎化能力の強さを実証している。

English

We present 3DiffTection, a state-of-the-art method for 3D object detection from single images, leveraging features from a 3D-aware diffusion model. Annotating large-scale image data for 3D detection is resource-intensive and time-consuming. Recently, pretrained large image diffusion models have become prominent as effective feature extractors for 2D perception tasks. However, these features are initially trained on paired text and image data, which are not optimized for 3D tasks, and often exhibit a domain gap when applied to the target data. Our approach bridges these gaps through two specialized tuning strategies: geometric and semantic. For geometric tuning, we fine-tune a diffusion model to perform novel view synthesis conditioned on a single image, by introducing a novel epipolar warp operator. This task meets two essential criteria: the necessity for 3D awareness and reliance solely on posed image data, which are readily available (e.g., from videos) and does not require manual annotation. For semantic refinement, we further train the model on target data with detection supervision. Both tuning phases employ ControlNet to preserve the integrity of the original feature capabilities. In the final step, we harness these enhanced capabilities to conduct a test-time prediction ensemble across multiple virtual viewpoints. Through our methodology, we obtain 3D-aware features that are tailored for 3D detection and excel in identifying cross-view point correspondences. Consequently, our model emerges as a powerful 3D detector, substantially surpassing previous benchmarks, e.g., Cube-RCNN, a precedent in single-view 3D detection by 9.43\% in AP3D on the Omni3D-ARkitscene dataset. Furthermore, 3DiffTection showcases robust data efficiency and generalization to cross-domain data.

3DiffTection: 幾何認識拡散特徴を用いた3D物体検出

3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features

要旨

Support