ZeroNVS: 단일 실사 이미지로부터의 제로샷 360도 뷰 합성

초록

복잡한 배경을 가진 실제 장면에서 단일 이미지 기반의 새로운 시점 합성을 위한 3D 인지 확산 모델인 ZeroNVS를 소개합니다. 기존 방법들이 마스크 처리된 배경을 가진 단일 객체를 대상으로 설계된 반면, 우리는 복잡한 배경을 가진 실제 다중 객체 장면에서 발생하는 문제를 해결하기 위한 새로운 기술을 제안합니다. 구체적으로, 객체 중심, 실내, 실외 장면을 포괄하는 다양한 데이터 소스의 혼합을 통해 생성적 사전 모델을 학습합니다. 데이터 혼합으로 인해 발생하는 깊이 스케일 모호성과 같은 문제를 해결하기 위해, 새로운 카메라 조건화 매개변수화 및 정규화 기법을 제안합니다. 또한, 360도 장면의 증류 과정에서 Score Distillation Sampling (SDS)이 복잡한 배경의 분포를 축소시키는 경향을 관찰하고, 이를 개선하기 위해 "SDS 앵커링"을 도입하여 합성된 새로운 시점의 다양성을 향상시켰습니다. 우리의 모델은 DTU 데이터셋에서 제로샷 설정으로 LPIPS 기준 최신 기술을 달성했으며, DTU에 특화된 방법들보다도 우수한 성능을 보였습니다. 더 나아가, 단일 이미지 기반 새로운 시점 합성을 위한 새로운 벤치마크로 도전적인 Mip-NeRF 360 데이터셋을 적용했으며, 이 설정에서도 강력한 성능을 입증했습니다. 우리의 코드와 데이터는 http://kylesargent.github.io/zeronvs/에서 확인할 수 있습니다.

English

We introduce a 3D-aware diffusion model, ZeroNVS, for single-image novel view synthesis for in-the-wild scenes. While existing methods are designed for single objects with masked backgrounds, we propose new techniques to address challenges introduced by in-the-wild multi-object scenes with complex backgrounds. Specifically, we train a generative prior on a mixture of data sources that capture object-centric, indoor, and outdoor scenes. To address issues from data mixture such as depth-scale ambiguity, we propose a novel camera conditioning parameterization and normalization scheme. Further, we observe that Score Distillation Sampling (SDS) tends to truncate the distribution of complex backgrounds during distillation of 360-degree scenes, and propose "SDS anchoring" to improve the diversity of synthesized novel views. Our model sets a new state-of-the-art result in LPIPS on the DTU dataset in the zero-shot setting, even outperforming methods specifically trained on DTU. We further adapt the challenging Mip-NeRF 360 dataset as a new benchmark for single-image novel view synthesis, and demonstrate strong performance in this setting. Our code and data are at http://kylesargent.github.io/zeronvs/

ZeroNVS: 단일 실사 이미지로부터의 제로샷 360도 뷰 합성

ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Real Image

초록

Support