ViewDiff:使用文本到图像模型生成3D一致的图像
ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models
March 4, 2024
作者: Lukas Höllein, Aljaž Božič, Norman Müller, David Novotny, Hung-Yu Tseng, Christian Richardt, Michael Zollhöfer, Matthias Nießner
cs.AI
摘要
受最近文本引导的2D内容创作成功的启发,3D资产生成正受到大量关注。现有的文本到3D方法使用预训练的文本到图像扩散模型来解决优化问题,或者在合成数据上对其进行微调,通常导致没有背景的非照片般逼真的3D对象。在本文中,我们提出了一种方法,利用预训练的文本到图像模型作为先验,并学习从真实世界数据中通过单个去噪过程生成多视角图像。具体而言,我们建议将3D体积渲染和跨帧注意力层整合到现有U-Net网络的每个块中,以改进文本到图像模型。此外,我们设计了一个自回归生成过程,可以在任何视角呈现更具3D一致性的图像。我们在真实世界的对象数据集上训练我们的模型,并展示其生成具有各种高质量形状和纹理的实例,并置于真实环境中。与现有方法相比,我们的方法生成的结果一致,并具有良好的视觉质量(FID减少30%,KID减少37%)。
English
3D asset generation is getting massive amounts of attention, inspired by the
recent success of text-guided 2D content creation. Existing text-to-3D methods
use pretrained text-to-image diffusion models in an optimization problem or
fine-tune them on synthetic data, which often results in non-photorealistic 3D
objects without backgrounds. In this paper, we present a method that leverages
pretrained text-to-image models as a prior, and learn to generate multi-view
images in a single denoising process from real-world data. Concretely, we
propose to integrate 3D volume-rendering and cross-frame-attention layers into
each block of the existing U-Net network of the text-to-image model. Moreover,
we design an autoregressive generation that renders more 3D-consistent images
at any viewpoint. We train our model on real-world datasets of objects and
showcase its capabilities to generate instances with a variety of high-quality
shapes and textures in authentic surroundings. Compared to the existing
methods, the results generated by our method are consistent, and have favorable
visual quality (-30% FID, -37% KID).