ViewDiff：使用文本到图像模型生成3D一致的图像

摘要

受最近文本引导的2D内容创作成功的启发，3D资产生成正受到大量关注。现有的文本到3D方法使用预训练的文本到图像扩散模型来解决优化问题，或者在合成数据上对其进行微调，通常导致没有背景的非照片般逼真的3D对象。在本文中，我们提出了一种方法，利用预训练的文本到图像模型作为先验，并学习从真实世界数据中通过单个去噪过程生成多视角图像。具体而言，我们建议将3D体积渲染和跨帧注意力层整合到现有U-Net网络的每个块中，以改进文本到图像模型。此外，我们设计了一个自回归生成过程，可以在任何视角呈现更具3D一致性的图像。我们在真实世界的对象数据集上训练我们的模型，并展示其生成具有各种高质量形状和纹理的实例，并置于真实环境中。与现有方法相比，我们的方法生成的结果一致，并具有良好的视觉质量（FID减少30％，KID减少37％）。

English

3D asset generation is getting massive amounts of attention, inspired by the recent success of text-guided 2D content creation. Existing text-to-3D methods use pretrained text-to-image diffusion models in an optimization problem or fine-tune them on synthetic data, which often results in non-photorealistic 3D objects without backgrounds. In this paper, we present a method that leverages pretrained text-to-image models as a prior, and learn to generate multi-view images in a single denoising process from real-world data. Concretely, we propose to integrate 3D volume-rendering and cross-frame-attention layers into each block of the existing U-Net network of the text-to-image model. Moreover, we design an autoregressive generation that renders more 3D-consistent images at any viewpoint. We train our model on real-world datasets of objects and showcase its capabilities to generate instances with a variety of high-quality shapes and textures in authentic surroundings. Compared to the existing methods, the results generated by our method are consistent, and have favorable visual quality (-30% FID, -37% KID).

ViewDiff：使用文本到图像模型生成3D一致的图像

ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models

摘要

Support