DORSal: オブジェクト中心のシーン表現のための拡散モデル et al.

要旨

3Dシーン理解における最近の進展により、多様なシーンからなる大規模データセットにわたる表現のスケーラブルな学習が可能となった。その結果、未見のシーンやオブジェクトへの一般化、単一または少数の入力画像からの新規視点のレンダリング、編集をサポートする制御可能なシーン生成が実現できるようになった。しかし、多数のシーンを共同で学習することは、通常、NeRFのような単一シーン最適化モデルと比較してレンダリング品質を損なう。本論文では、拡散モデルの最近の進展を活用し、3Dシーン表現学習モデルに高忠実度の新規視点レンダリング能力を付与しつつ、オブジェクトレベルのシーン編集といった利点を大幅に保持する。特に、オブジェクト中心のスロットベースのシーン表現を条件とした3Dシーン生成のためにビデオ拡散アーキテクチャを適応させたDORSalを提案する。複雑な合成マルチオブジェクトシーンと実世界の大規模ストリートビューデータセットの両方において、DORSalがオブジェクトレベルの編集を伴う3Dシーンのスケーラブルなニューラルレンダリングを可能にし、既存のアプローチを改善することを示す。

English

Recent progress in 3D scene understanding enables scalable learning of representations across large datasets of diverse scenes. As a consequence, generalization to unseen scenes and objects, rendering novel views from just a single or a handful of input images, and controllable scene generation that supports editing, is now possible. However, training jointly on a large number of scenes typically compromises rendering quality when compared to single-scene optimized models such as NeRFs. In this paper, we leverage recent progress in diffusion models to equip 3D scene representation learning models with the ability to render high-fidelity novel views, while retaining benefits such as object-level scene editing to a large degree. In particular, we propose DORSal, which adapts a video diffusion architecture for 3D scene generation conditioned on object-centric slot-based representations of scenes. On both complex synthetic multi-object scenes and on the real-world large-scale Street View dataset, we show that DORSal enables scalable neural rendering of 3D scenes with object-level editing and improves upon existing approaches.

DORSal: オブジェクト中心のシーン表現のための拡散モデル et al.

DORSal: Diffusion for Object-centric Representations of Scenes et al.

要旨

Support