DORSal: 장면의 객체 중심 표현을 위한 디퓨전

초록

최근 3D 장면 이해 분야의 발전으로 다양한 장면으로 구성된 대규모 데이터셋에서 표현을 확장 가능하게 학습할 수 있게 되었습니다. 그 결과, 보지 못한 장면과 객체에 대한 일반화, 단일 또는 소수의 입력 이미지로부터 새로운 시점 렌더링, 그리고 편집을 지원하는 제어 가능한 장면 생성이 이제 가능해졌습니다. 그러나 대규모 장면 데이터셋을 공동으로 학습하는 경우, 일반적으로 NeRF와 같은 단일 장면 최적화 모델에 비해 렌더링 품질이 저하되는 문제가 있습니다. 본 논문에서는 확산 모델(diffusion model)의 최근 발전을 활용하여 3D 장면 표현 학습 모델이 고품질의 새로운 시점 렌더링을 수행할 수 있도록 하면서도 객체 수준의 장면 편집과 같은 이점을 크게 유지할 수 있는 방법을 제안합니다. 특히, 우리는 객체 중심의 슬롯 기반 장면 표현을 조건으로 하는 3D 장면 생성을 위해 비디오 확산 아키텍처를 적용한 DORSal을 제안합니다. 복잡한 합성 다중 객체 장면과 대규모 실세계 Street View 데이터셋에서 DORSal이 객체 수준 편집이 가능한 확장 가능한 신경 렌더링을 지원하며 기존 접근법을 개선함을 보여줍니다.

English

Recent progress in 3D scene understanding enables scalable learning of representations across large datasets of diverse scenes. As a consequence, generalization to unseen scenes and objects, rendering novel views from just a single or a handful of input images, and controllable scene generation that supports editing, is now possible. However, training jointly on a large number of scenes typically compromises rendering quality when compared to single-scene optimized models such as NeRFs. In this paper, we leverage recent progress in diffusion models to equip 3D scene representation learning models with the ability to render high-fidelity novel views, while retaining benefits such as object-level scene editing to a large degree. In particular, we propose DORSal, which adapts a video diffusion architecture for 3D scene generation conditioned on object-centric slot-based representations of scenes. On both complex synthetic multi-object scenes and on the real-world large-scale Street View dataset, we show that DORSal enables scalable neural rendering of 3D scenes with object-level editing and improves upon existing approaches.

DORSal: 장면의 객체 중심 표현을 위한 디퓨전

DORSal: Diffusion for Object-centric Representations of Scenes et al.

초록

Support