DORSal：用於場景物體中心表示的擴散等。

摘要

最近在3D場景理解方面取得的進展使得能夠跨越大量不同場景的數據集進行可擴展的表示學習成為可能。因此，對於未見過的場景和物體的泛化，僅從單個或少量輸入圖像中生成新視角，以及支持編輯的可控場景生成現在都是可能的。然而，通常在大量場景上聯合訓練時，與像NeRFs這樣針對單個場景進行優化的模型相比，通常會影響渲染質量。在本文中，我們利用最近擴散模型的進展，為3D場景表示學習模型提供了渲染高保真新視角的能力，同時保留了諸如對象級場景編輯等好處。具體而言，我們提出了DORSal，它將視頻擴散架構應用於基於場景中心化槽位表示的3D場景生成。在複雜的合成多對象場景和現實世界的大規模Street View數據集上，我們展示了DORSal實現了可擴展的神經渲染3D場景的能力，並改進了現有方法。

English

Recent progress in 3D scene understanding enables scalable learning of representations across large datasets of diverse scenes. As a consequence, generalization to unseen scenes and objects, rendering novel views from just a single or a handful of input images, and controllable scene generation that supports editing, is now possible. However, training jointly on a large number of scenes typically compromises rendering quality when compared to single-scene optimized models such as NeRFs. In this paper, we leverage recent progress in diffusion models to equip 3D scene representation learning models with the ability to render high-fidelity novel views, while retaining benefits such as object-level scene editing to a large degree. In particular, we propose DORSal, which adapts a video diffusion architecture for 3D scene generation conditioned on object-centric slot-based representations of scenes. On both complex synthetic multi-object scenes and on the real-world large-scale Street View dataset, we show that DORSal enables scalable neural rendering of 3D scenes with object-level editing and improves upon existing approaches.

DORSal：用於場景物體中心表示的擴散等。

DORSal: Diffusion for Object-centric Representations of Scenes et al.

摘要

Support