X-Dreamer: テキストから2D生成とテキストから3D生成のドメインギャップを橋渡しすることで高品質な3Dコンテンツを創出する

要旨

近年、事前学習済みの2D拡散モデルの発展に後押しされ、自動的なテキストから3Dコンテンツ生成が著しい進歩を遂げています。既存のテキストから3D生成手法では、一般的に3D表現を最適化し、レンダリングされた画像が与えられたテキストとよく一致するようにします。これは事前学習済みの2D拡散モデルによって評価されます。しかし、2D画像と3Dアセットの間には大きなドメインギャップが存在し、主にカメラ関連の属性の変動や前景オブジェクトのみが存在することに起因しています。そのため、2D拡散モデルを直接3D表現の最適化に用いると、最適でない結果を招く可能性があります。この問題を解決するため、我々はX-Dreamerを提案します。これはテキストから2D生成とテキストから3D生成のギャップを効果的に埋める、高品質なテキストから3Dコンテンツ生成の新しいアプローチです。X-Dreamerの主要な構成要素は、2つの革新的な設計です：カメラ誘導型低ランク適応（CG-LoRA）とアテンションマスクアライメント（AMA）損失です。CG-LoRAは、学習可能なパラメータに対してカメラ依存の生成を採用することで、事前学習済みの拡散モデルにカメラ情報を動的に組み込みます。これにより、生成された3Dアセットとカメラの視点との整合性が向上します。AMA損失は、3Dオブジェクトのバイナリマスクを使用して事前学習済みの拡散モデルのアテンションマップを誘導し、前景オブジェクトの生成を優先します。このモジュールにより、モデルが正確で詳細な前景オブジェクトの生成に集中することが保証されます。広範な評価により、提案手法が既存のテキストから3D生成手法と比較して有効性を発揮することが実証されています。プロジェクトのウェブページはこちらです：https://xmuxiaoma666.github.io/Projects/X-Dreamer

English

In recent times, automatic text-to-3D content creation has made significant progress, driven by the development of pretrained 2D diffusion models. Existing text-to-3D methods typically optimize the 3D representation to ensure that the rendered image aligns well with the given text, as evaluated by the pretrained 2D diffusion model. Nevertheless, a substantial domain gap exists between 2D images and 3D assets, primarily attributed to variations in camera-related attributes and the exclusive presence of foreground objects. Consequently, employing 2D diffusion models directly for optimizing 3D representations may lead to suboptimal outcomes. To address this issue, we present X-Dreamer, a novel approach for high-quality text-to-3D content creation that effectively bridges the gap between text-to-2D and text-to-3D synthesis. The key components of X-Dreamer are two innovative designs: Camera-Guided Low-Rank Adaptation (CG-LoRA) and Attention-Mask Alignment (AMA) Loss. CG-LoRA dynamically incorporates camera information into the pretrained diffusion models by employing camera-dependent generation for trainable parameters. This integration enhances the alignment between the generated 3D assets and the camera's perspective. AMA loss guides the attention map of the pretrained diffusion model using the binary mask of the 3D object, prioritizing the creation of the foreground object. This module ensures that the model focuses on generating accurate and detailed foreground objects. Extensive evaluations demonstrate the effectiveness of our proposed method compared to existing text-to-3D approaches. Our project webpage: https://xmuxiaoma666.github.io/Projects/X-Dreamer .

X-Dreamer: テキストから2D生成とテキストから3D生成のドメインギャップを橋渡しすることで高品質な3Dコンテンツを創出する

X-Dreamer: Creating High-quality 3D Content by Bridging the Domain Gap Between Text-to-2D and Text-to-3D Generation

要旨

Support