レイアウト学習による分離型3Dシーン生成

要旨

本論文では、3Dシーンを構成要素となるオブジェクトに分離して生成する手法を提案します。この分離は教師なしで行われ、大規模な事前学習済みテキスト画像生成モデルの知識のみに依存しています。私たちの重要な洞察は、3Dシーンの一部を空間的に再配置しても、同じシーンの有効な構成が維持される部分を見つけることで、オブジェクトを発見できるという点です。具体的には、本手法では、各オブジェクトを表現する複数のNeRFをゼロから同時に最適化し、これらのオブジェクトをシーンに合成するレイアウトのセットも合わせて最適化します。そして、これらの合成されたシーンが画像生成器の分布内に収まるよう促します。本手法はシンプルながらも、3Dシーンを個々のオブジェクトに分解して生成することに成功し、テキストから3Dコンテンツを作成する新たな可能性を拓きます。結果とインタラクティブデモについては、プロジェクトページ（https://dave.ml/layoutlearning/）をご覧ください。

English

We introduce a method to generate 3D scenes that are disentangled into their component objects. This disentanglement is unsupervised, relying only on the knowledge of a large pretrained text-to-image model. Our key insight is that objects can be discovered by finding parts of a 3D scene that, when rearranged spatially, still produce valid configurations of the same scene. Concretely, our method jointly optimizes multiple NeRFs from scratch - each representing its own object - along with a set of layouts that composite these objects into scenes. We then encourage these composited scenes to be in-distribution according to the image generator. We show that despite its simplicity, our approach successfully generates 3D scenes decomposed into individual objects, enabling new capabilities in text-to-3D content creation. For results and an interactive demo, see our project page at https://dave.ml/layoutlearning/

レイアウト学習による分離型3Dシーン生成

Disentangled 3D Scene Generation with Layout Learning

要旨

Support