X-Dreamer:通過橋接文本到2D和文本到3D生成之間的領域差距來創建高質量的3D內容
X-Dreamer: Creating High-quality 3D Content by Bridging the Domain Gap Between Text-to-2D and Text-to-3D Generation
November 30, 2023
作者: Yiwei Ma, Yijun Fan, Jiayi Ji, Haowei Wang, Xiaoshuai Sun, Guannan Jiang, Annan Shu, Rongrong Ji
cs.AI
摘要
近年來,自動文本轉3D內容創作取得了顯著進展,這得益於預訓練的2D擴散模型的發展。現有的文本轉3D方法通常優化3D表示,以確保渲染的圖像與給定文本良好對齊,由預訓練的2D擴散模型評估。然而,2D圖像和3D資產之間存在顯著的領域差距,主要歸因於與相機相關的屬性變化和前景物體的獨特存在。因此,直接使用2D擴散模型來優化3D表示可能導致次優結果。為了解決這個問題,我們提出了X-Dreamer,一種新穎的高質量文本轉3D內容創作方法,有效地彌合了文本轉2D和文本轉3D合成之間的差距。X-Dreamer的關鍵組件是兩個創新設計:Camera-Guided Low-Rank Adaptation(CG-LoRA)和Attention-Mask Alignment(AMA)Loss。CG-LoRA通過使用與相機相關的生成來訓練參數,動態地將相機信息整合到預訓練的擴散模型中。這種整合增強了生成的3D資產與相機視角之間的對齊。AMA loss使用3D物體的二值化遮罩引導預訓練擴散模型的注意力地圖,優先考慮前景物體的創建。該模塊確保模型專注於生成準確和詳細的前景物體。廣泛的評估顯示了我們提出的方法相對於現有的文本轉3D方法的有效性。我們的項目網頁:https://xmuxiaoma666.github.io/Projects/X-Dreamer。
English
In recent times, automatic text-to-3D content creation has made significant
progress, driven by the development of pretrained 2D diffusion models. Existing
text-to-3D methods typically optimize the 3D representation to ensure that the
rendered image aligns well with the given text, as evaluated by the pretrained
2D diffusion model. Nevertheless, a substantial domain gap exists between 2D
images and 3D assets, primarily attributed to variations in camera-related
attributes and the exclusive presence of foreground objects. Consequently,
employing 2D diffusion models directly for optimizing 3D representations may
lead to suboptimal outcomes. To address this issue, we present X-Dreamer, a
novel approach for high-quality text-to-3D content creation that effectively
bridges the gap between text-to-2D and text-to-3D synthesis. The key components
of X-Dreamer are two innovative designs: Camera-Guided Low-Rank Adaptation
(CG-LoRA) and Attention-Mask Alignment (AMA) Loss. CG-LoRA dynamically
incorporates camera information into the pretrained diffusion models by
employing camera-dependent generation for trainable parameters. This
integration enhances the alignment between the generated 3D assets and the
camera's perspective. AMA loss guides the attention map of the pretrained
diffusion model using the binary mask of the 3D object, prioritizing the
creation of the foreground object. This module ensures that the model focuses
on generating accurate and detailed foreground objects. Extensive evaluations
demonstrate the effectiveness of our proposed method compared to existing
text-to-3D approaches. Our project webpage:
https://xmuxiaoma666.github.io/Projects/X-Dreamer .