HexaGen3D: StableDiffusionは、高速で多様なテキストから3D生成へあと一歩のところにあります

要旨

生成モデリングにおける最新の顕著な進展にもかかわらず、テキストプロンプトから高品質な3Dアセットを効率的に生成することは依然として困難な課題です。主な課題はデータの不足にあります。最も大規模な3Dデータセットでも数百万のアセットしか含まれていないのに対し、2Dデータセットには数十億のテキスト-画像ペアが存在します。この問題に対処するため、我々は大規模な事前学習済み2D拡散モデルの力を活用する新しいアプローチを提案します。具体的には、我々のアプローチであるHexaGen3Dは、事前学習済みのテキスト-to-画像モデルを微調整し、6つの正射投影図と対応する潜在トライプレーンを同時に予測します。その後、これらの潜在変数をデコードしてテクスチャ付きメッシュを生成します。HexaGen3Dはサンプルごとの最適化を必要とせず、テキストプロンプトから7秒で高品質かつ多様なオブジェクトを推論でき、既存のアプローチと比較して品質とレイテンシのトレードオフを大幅に改善します。さらに、HexaGen3Dは新しいオブジェクトや構成に対する強い汎化性能を示します。

English

Despite the latest remarkable advances in generative modeling, efficient generation of high-quality 3D assets from textual prompts remains a difficult task. A key challenge lies in data scarcity: the most extensive 3D datasets encompass merely millions of assets, while their 2D counterparts contain billions of text-image pairs. To address this, we propose a novel approach which harnesses the power of large, pretrained 2D diffusion models. More specifically, our approach, HexaGen3D, fine-tunes a pretrained text-to-image model to jointly predict 6 orthographic projections and the corresponding latent triplane. We then decode these latents to generate a textured mesh. HexaGen3D does not require per-sample optimization, and can infer high-quality and diverse objects from textual prompts in 7 seconds, offering significantly better quality-to-latency trade-offs when comparing to existing approaches. Furthermore, HexaGen3D demonstrates strong generalization to new objects or compositions.

HexaGen3D: StableDiffusionは、高速で多様なテキストから3D生成へあと一歩のところにあります

HexaGen3D: StableDiffusion is just one step away from Fast and Diverse Text-to-3D Generation

要旨

Support