ChatPaper.aiChatPaper

X-Dreamer:通过弥合文本到2D和文本到3D生成之间的领域差距来创建高质量的3D内容

X-Dreamer: Creating High-quality 3D Content by Bridging the Domain Gap Between Text-to-2D and Text-to-3D Generation

November 30, 2023
作者: Yiwei Ma, Yijun Fan, Jiayi Ji, Haowei Wang, Xiaoshuai Sun, Guannan Jiang, Annan Shu, Rongrong Ji
cs.AI

摘要

最近,由于预训练的2D扩散模型的发展,自动文本到3D内容的创建取得了显著进展。现有的文本到3D方法通常优化3D表示,以确保渲染的图像与给定文本良好对齐,由预训练的2D扩散模型评估。然而,2D图像和3D资产之间存在实质性的领域差距,主要归因于与相机相关属性的变化以及前景对象的独占存在。因此,直接利用2D扩散模型优化3D表示可能导致次优结果。为解决这一问题,我们提出了X-Dreamer,一种用于高质量文本到3D内容创建的新方法,有效地弥合了文本到2D和文本到3D合成之间的差距。X-Dreamer的关键组成部分是两个创新设计:Camera-Guided Low-Rank Adaptation(CG-LoRA)和Attention-Mask Alignment(AMA)Loss。CG-LoRA通过将相机信息动态地整合到预训练的扩散模型中,通过使用相机相关的生成来进行可训练参数。这种整合增强了生成的3D资产与相机视角之间的对齐。AMA loss使用3D对象的二进制掩模引导预训练扩散模型的注意力图,优先考虑前景对象的创建。该模块确保模型专注于生成准确和详细的前景对象。广泛的评估显示了我们提出的方法相对于现有的文本到3D方法的有效性。我们的项目网页:https://xmuxiaoma666.github.io/Projects/X-Dreamer。
English
In recent times, automatic text-to-3D content creation has made significant progress, driven by the development of pretrained 2D diffusion models. Existing text-to-3D methods typically optimize the 3D representation to ensure that the rendered image aligns well with the given text, as evaluated by the pretrained 2D diffusion model. Nevertheless, a substantial domain gap exists between 2D images and 3D assets, primarily attributed to variations in camera-related attributes and the exclusive presence of foreground objects. Consequently, employing 2D diffusion models directly for optimizing 3D representations may lead to suboptimal outcomes. To address this issue, we present X-Dreamer, a novel approach for high-quality text-to-3D content creation that effectively bridges the gap between text-to-2D and text-to-3D synthesis. The key components of X-Dreamer are two innovative designs: Camera-Guided Low-Rank Adaptation (CG-LoRA) and Attention-Mask Alignment (AMA) Loss. CG-LoRA dynamically incorporates camera information into the pretrained diffusion models by employing camera-dependent generation for trainable parameters. This integration enhances the alignment between the generated 3D assets and the camera's perspective. AMA loss guides the attention map of the pretrained diffusion model using the binary mask of the 3D object, prioritizing the creation of the foreground object. This module ensures that the model focuses on generating accurate and detailed foreground objects. Extensive evaluations demonstrate the effectiveness of our proposed method compared to existing text-to-3D approaches. Our project webpage: https://xmuxiaoma666.github.io/Projects/X-Dreamer .
PDF92December 15, 2024