3DiffTection: 기하학 인식 확산 특징을 활용한 3D 객체 검출

초록

단일 이미지에서 3D 객체 탐지를 위한 최신 방법인 3DiffTection을 소개합니다. 이 방법은 3D 인식 확산 모델의 특징을 활용합니다. 대규모 이미지 데이터에 대한 3D 탐지 주석 작업은 자원이 많이 들고 시간이 소요됩니다. 최근, 사전 학습된 대형 이미지 확산 모델이 2D 인식 작업을 위한 효과적인 특징 추출기로 주목받고 있습니다. 그러나 이러한 특징은 초기에 텍스트와 이미지 쌍 데이터로 학습되어 3D 작업에 최적화되지 않았으며, 대상 데이터에 적용할 때 종종 도메인 간격을 보입니다. 우리의 접근 방식은 두 가지 전문화된 튜닝 전략, 즉 기하학적 및 의미론적 튜닝을 통해 이러한 간격을 해소합니다. 기하학적 튜닝을 위해, 우리는 단일 이미지를 조건으로 새로운 시점 합성을 수행하도록 확산 모델을 미세 조정하며, 새로운 에피폴라 워프 연산자를 도입합니다. 이 작업은 3D 인식의 필요성과 포즈가 지정된 이미지 데이터에만 의존한다는 두 가지 필수 기준을 충족합니다. 이러한 데이터는 비디오와 같은 소스에서 쉽게 얻을 수 있으며 수동 주석이 필요하지 않습니다. 의미론적 개선을 위해, 우리는 탐지 감독을 통해 대상 데이터에 대해 모델을 추가로 학습시킵니다. 두 튜닝 단계 모두 ControlNet을 사용하여 원래 특징 기능의 무결성을 유지합니다. 마지막 단계에서, 우리는 이러한 강화된 기능을 활용하여 여러 가상 시점에 걸쳐 테스트 시점 예측 앙상블을 수행합니다. 우리의 방법론을 통해, 3D 탐지에 맞춤화되고 교차 시점 포인트 대응을 우수하게 식별하는 3D 인식 특징을 얻습니다. 결과적으로, 우리의 모델은 강력한 3D 탐지기로 등장하며, Omni3D-ARkitscene 데이터셋에서 단일 시점 3D 탐지의 선구자인 Cube-RCNN을 AP3D 기준으로 9.43% 능가합니다. 또한, 3DiffTection은 강력한 데이터 효율성과 교차 도메인 데이터에 대한 일반화 능력을 보여줍니다.

English

We present 3DiffTection, a state-of-the-art method for 3D object detection from single images, leveraging features from a 3D-aware diffusion model. Annotating large-scale image data for 3D detection is resource-intensive and time-consuming. Recently, pretrained large image diffusion models have become prominent as effective feature extractors for 2D perception tasks. However, these features are initially trained on paired text and image data, which are not optimized for 3D tasks, and often exhibit a domain gap when applied to the target data. Our approach bridges these gaps through two specialized tuning strategies: geometric and semantic. For geometric tuning, we fine-tune a diffusion model to perform novel view synthesis conditioned on a single image, by introducing a novel epipolar warp operator. This task meets two essential criteria: the necessity for 3D awareness and reliance solely on posed image data, which are readily available (e.g., from videos) and does not require manual annotation. For semantic refinement, we further train the model on target data with detection supervision. Both tuning phases employ ControlNet to preserve the integrity of the original feature capabilities. In the final step, we harness these enhanced capabilities to conduct a test-time prediction ensemble across multiple virtual viewpoints. Through our methodology, we obtain 3D-aware features that are tailored for 3D detection and excel in identifying cross-view point correspondences. Consequently, our model emerges as a powerful 3D detector, substantially surpassing previous benchmarks, e.g., Cube-RCNN, a precedent in single-view 3D detection by 9.43\% in AP3D on the Omni3D-ARkitscene dataset. Furthermore, 3DiffTection showcases robust data efficiency and generalization to cross-domain data.

3DiffTection: 기하학 인식 확산 특징을 활용한 3D 객체 검출

3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features

초록

Support