MaGRITTe:基于图像、顶视图和文本的3D操纵与生成实现
MaGRITTe: Manipulative and Generative 3D Realization from Image, Topview and Text
March 30, 2024
作者: Takayuki Hara, Tatsuya Harada
cs.AI
摘要
从用户指定条件生成3D场景,为减轻3D应用中的制作负担提供了有前景的途径。以往研究因控制条件有限,实现理想场景需付出巨大努力。我们提出一种在多模态条件下控制并生成3D场景的方法,利用局部图像、俯视图表示的布局信息及文本提示。结合这些条件生成3D场景面临以下重大挑战:(1)创建大型数据集,(2)多模态条件交互的考量,(3)布局条件的领域依赖性。我们将3D场景生成过程分解为从给定条件生成2D图像和从2D图像生成3D场景。2D图像生成通过微调预训练的文本到图像模型,使用少量局部图像和布局的人工数据集实现;3D场景生成则通过布局条件下的深度估计和神经辐射场(NeRF)实现,从而避免创建大型数据集。采用360度图像的通用空间信息表示,有助于考虑多模态条件交互并减少布局控制的领域依赖性。实验结果从定性和定量两方面证明,所提方法能根据多模态条件在室内外多样领域生成3D场景。
English
The generation of 3D scenes from user-specified conditions offers a promising
avenue for alleviating the production burden in 3D applications. Previous
studies required significant effort to realize the desired scene, owing to
limited control conditions. We propose a method for controlling and generating
3D scenes under multimodal conditions using partial images, layout information
represented in the top view, and text prompts. Combining these conditions to
generate a 3D scene involves the following significant difficulties: (1) the
creation of large datasets, (2) reflection on the interaction of multimodal
conditions, and (3) domain dependence of the layout conditions. We decompose
the process of 3D scene generation into 2D image generation from the given
conditions and 3D scene generation from 2D images. 2D image generation is
achieved by fine-tuning a pretrained text-to-image model with a small
artificial dataset of partial images and layouts, and 3D scene generation is
achieved by layout-conditioned depth estimation and neural radiance fields
(NeRF), thereby avoiding the creation of large datasets. The use of a common
representation of spatial information using 360-degree images allows for the
consideration of multimodal condition interactions and reduces the domain
dependence of the layout control. The experimental results qualitatively and
quantitatively demonstrated that the proposed method can generate 3D scenes in
diverse domains, from indoor to outdoor, according to multimodal conditions.Summary
AI-Generated Summary