잠재 3D 확산 모델의 자동 디코딩

초록

본 논문에서는 3D 자동 디코더를 핵심으로 하는 정적 및 관절형 3D 자산 생성에 대한 새로운 접근 방식을 제시한다. 3D 자동 디코더 프레임워크는 대상 데이터셋에서 학습된 속성을 잠재 공간에 임베딩하며, 이를 볼륨트릭 표현으로 디코딩하여 시점 일관적인 외관과 기하학을 렌더링할 수 있다. 이후 적절한 중간 볼륨트릭 잠재 공간을 식별하고, 강력한 정규화 및 비정규화 연산을 도입하여 2D 이미지 또는 단안 비디오로부터 고정형 또는 관절형 객체의 3D 확산을 학습한다. 제안된 접근 방식은 기존의 카메라 감독을 사용하거나 카메라 정보를 전혀 사용하지 않고도 유연하게 적용 가능하며, 대신 훈련 중에 이를 효율적으로 학습한다. 평가 결과, 제안 방식은 합성 객체의 다중 시점 이미지 데이터셋, 움직이는 사람의 실제 야외 비디오, 정적 객체의 대규모 실제 비디오 데이터셋을 포함한 다양한 벤치마크 데이터셋과 지표에서 최신 대안들을 능가하는 생성 결과를 보여준다.

English

We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core. The 3D autodecoder framework embeds properties learned from the target dataset in the latent space, which can then be decoded into a volumetric representation for rendering view-consistent appearance and geometry. We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations to learn a 3D diffusion from 2D images or monocular videos of rigid or articulated objects. Our approach is flexible enough to use either existing camera supervision or no camera information at all -- instead efficiently learning it during training. Our evaluations demonstrate that our generation results outperform state-of-the-art alternatives on various benchmark datasets and metrics, including multi-view image datasets of synthetic objects, real in-the-wild videos of moving people, and a large-scale, real video dataset of static objects.

잠재 3D 확산 모델의 자동 디코딩

AutoDecoding Latent 3D Diffusion Models

초록

Support