레이아웃 학습을 통한 분리된 3D 장면 생성

초록

우리는 3D 장면을 구성 요소 객체들로 분리하여 생성하는 방법을 소개합니다. 이 분리 과정은 비지도 학습 방식으로, 대규모 사전 학습된 텍스트-이미지 모델의 지식에만 의존합니다. 우리의 핵심 통찰은, 3D 장면의 일부를 공간적으로 재배치했을 때 동일한 장면의 유효한 구성이 여전히 생성될 수 있는 부분을 찾음으로써 객체를 발견할 수 있다는 것입니다. 구체적으로, 우리의 방법은 각각 자신만의 객체를 나타내는 여러 NeRF를 처음부터 함께 최적화하고, 이러한 객체들을 장면으로 합성하는 레이아웃 세트를 함께 최적화합니다. 그런 다음, 이러한 합성된 장면이 이미지 생성기에 따라 분포 내에 있도록 유도합니다. 우리는 이 방법이 단순함에도 불구하고, 3D 장면을 개별 객체들로 분해하여 성공적으로 생성하며, 텍스트-3D 콘텐츠 생성에서 새로운 가능성을 열어준다는 것을 보여줍니다. 결과와 인터랙티브 데모는 프로젝트 페이지(https://dave.ml/layoutlearning/)에서 확인할 수 있습니다.

English

We introduce a method to generate 3D scenes that are disentangled into their component objects. This disentanglement is unsupervised, relying only on the knowledge of a large pretrained text-to-image model. Our key insight is that objects can be discovered by finding parts of a 3D scene that, when rearranged spatially, still produce valid configurations of the same scene. Concretely, our method jointly optimizes multiple NeRFs from scratch - each representing its own object - along with a set of layouts that composite these objects into scenes. We then encourage these composited scenes to be in-distribution according to the image generator. We show that despite its simplicity, our approach successfully generates 3D scenes decomposed into individual objects, enabling new capabilities in text-to-3D content creation. For results and an interactive demo, see our project page at https://dave.ml/layoutlearning/

레이아웃 학습을 통한 분리된 3D 장면 생성

Disentangled 3D Scene Generation with Layout Learning

초록

Support