MVDiffusion++: 단일 또는 희소 뷰 3D 객체 재구성을 위한 고해상도 다중 뷰 밀집 확산 모델

초록

본 논문은 카메라 포즈 없이 하나 또는 소수의 이미지가 주어졌을 때, 물체의 조밀하고 고해상도의 뷰를 합성하는 3D 객체 재구성을 위한 신경망 아키텍처인 MVDiffusion++를 제시한다. MVDiffusion++는 두 가지 놀랍도록 간단한 아이디어를 통해 우수한 유연성과 확장성을 달성한다: 1) 2D 잠재 특성들 간의 표준 자기 주의 메커니즘이 카메라 포즈 정보를 명시적으로 사용하지 않고도 임의의 수의 조건부 및 생성 뷰 간의 3D 일관성을 학습하는 "포즈 없는 아키텍처"; 2) 훈련 중 상당한 수의 출력 뷰를 버리는 "뷰 드롭아웃 전략"으로, 이는 훈련 시 메모리 사용량을 줄이고 테스트 시 조밀하고 고해상도의 뷰 합성을 가능하게 한다. 우리는 Objaverse 데이터셋을 훈련에 사용하고 Google Scanned Objects 데이터셋을 평가에 사용하여 표준 새로운 뷰 합성 및 3D 재구성 지표를 통해 MVDiffusion++가 현재 최신 기술을 크게 능가함을 보여준다. 또한, MVDiffusion++를 텍스트-이미지 생성 모델과 결합하여 텍스트-3D 응용 예시를 시연한다.

English

This paper presents a neural architecture MVDiffusion++ for 3D object reconstruction that synthesizes dense and high-resolution views of an object given one or a few images without camera poses. MVDiffusion++ achieves superior flexibility and scalability with two surprisingly simple ideas: 1) A ``pose-free architecture'' where standard self-attention among 2D latent features learns 3D consistency across an arbitrary number of conditional and generation views without explicitly using camera pose information; and 2) A ``view dropout strategy'' that discards a substantial number of output views during training, which reduces the training-time memory footprint and enables dense and high-resolution view synthesis at test time. We use the Objaverse for training and the Google Scanned Objects for evaluation with standard novel view synthesis and 3D reconstruction metrics, where MVDiffusion++ significantly outperforms the current state of the arts. We also demonstrate a text-to-3D application example by combining MVDiffusion++ with a text-to-image generative model.

MVDiffusion++: 단일 또는 희소 뷰 3D 객체 재구성을 위한 고해상도 다중 뷰 밀집 확산 모델

MVDiffusion++: A Dense High-resolution Multi-view Diffusion Model for Single or Sparse-view 3D Object Reconstruction

초록

Support