Sherpa3D：粗い3D事前情報による高忠実度テキストから3D生成の強化

要旨

近年、テキストプロンプトからの3Dコンテンツ生成は、2Dおよび3D拡散モデルを活用することで顕著な進歩を遂げています。3D拡散モデルは優れたマルチビュー一貫性を保証しますが、高品質で多様な3Dアセットを生成する能力は、限られた3Dデータによって制約されています。一方、2D拡散モデルは、3Dデータを一切必要とせずに優れた汎化性と豊富な詳細を実現する蒸留アプローチを見出しています。しかし、2Dリフティング手法は本質的なビュー非依存の曖昧さに悩まされており、これにより深刻な多面ジャヌス問題が発生し、テキストプロンプトが一貫した3D結果を学習するための十分なガイダンスを提供できません。コストのかかるビューポイント認識モデルを再トレーニングする代わりに、我々は容易にアクセス可能な粗い3D知識を活用してプロンプトを強化し、2Dリフティング最適化をガイドして洗練する方法を研究します。本論文では、高忠実度、汎化性、および幾何学的整合性を同時に実現する新しいテキストto3DフレームワークであるSherpa3Dを提案します。具体的には、3D拡散モデルによって生成された粗い3D事前知識から導出された2つのガイダンス戦略を設計します：幾何学的忠実度のための構造的ガイダンスと、3D一貫性のための意味的ガイダンスです。これら2種類のガイダンスを採用することで、2D拡散モデルは多様で高品質な結果を伴う3Dコンテンツを豊かにします。広範な実験により、我々のSherpa3Dが品質と3D整合性の点で最先端のテキストto3D手法を凌駕することを示します。

English

Recently, 3D content creation from text prompts has demonstrated remarkable progress by utilizing 2D and 3D diffusion models. While 3D diffusion models ensure great multi-view consistency, their ability to generate high-quality and diverse 3D assets is hindered by the limited 3D data. In contrast, 2D diffusion models find a distillation approach that achieves excellent generalization and rich details without any 3D data. However, 2D lifting methods suffer from inherent view-agnostic ambiguity thereby leading to serious multi-face Janus issues, where text prompts fail to provide sufficient guidance to learn coherent 3D results. Instead of retraining a costly viewpoint-aware model, we study how to fully exploit easily accessible coarse 3D knowledge to enhance the prompts and guide 2D lifting optimization for refinement. In this paper, we propose Sherpa3D, a new text-to-3D framework that achieves high-fidelity, generalizability, and geometric consistency simultaneously. Specifically, we design a pair of guiding strategies derived from the coarse 3D prior generated by the 3D diffusion model: a structural guidance for geometric fidelity and a semantic guidance for 3D coherence. Employing the two types of guidance, the 2D diffusion model enriches the 3D content with diversified and high-quality results. Extensive experiments show the superiority of our Sherpa3D over the state-of-the-art text-to-3D methods in terms of quality and 3D consistency.

Sherpa3D：粗い3D事前情報による高忠実度テキストから3D生成の強化

Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior

要旨

Support