表面整合ガウススプラッティングによる制御可能なテキストから3D生成

要旨

テキストから3Dおよび画像から3D生成タスクは大きな注目を集めてきたが、その間に位置する重要なものの未開拓の分野として、制御可能なテキストから3D生成が挙げられる。本論文ではこのタスクに焦点を当てる。1) 本論文では、既存の事前学習済み多視点拡散モデルを強化するために、エッジ、深度、法線、スケッチマップなどの追加入力条件を統合する新しいニューラルネットワークアーキテクチャであるMulti-view ControlNet (MVControl)を提案する。我々の革新は、入力条件画像とカメラポーズから計算されるローカルおよびグローバル埋め込みを用いて基本拡散モデルを制御する条件付けモジュールの導入にある。一度学習されると、MVControlは最適化ベースの3D生成のための3D拡散ガイダンスを提供することができる。2) 我々は、最近の大規模再構成モデルとスコア蒸留アルゴリズムの利点を活用する効率的な多段階3D生成パイプラインを提案する。MVControlアーキテクチャを基盤として、最適化プロセスを導くための独自のハイブリッド拡散ガイダンス手法を採用する。効率性を追求するため、一般的に使用される暗黙的表現ではなく、3Dガウシアンを表現として採用する。また、ガウシアンをメッシュ三角形面にバインドするハイブリッド表現であるSuGaRの使用を先駆的に導入する。このアプローチは、3Dガウシアンの幾何学的な問題を緩和し、メッシュ上での微細な幾何学の直接的な彫刻を可能にする。広範な実験により、我々の手法が堅牢な汎化を達成し、高品質な3Dコンテンツの制御可能な生成を実現することが示された。

English

While text-to-3D and image-to-3D generation tasks have received considerable attention, one important but under-explored field between them is controllable text-to-3D generation, which we mainly focus on in this work. To address this task, 1) we introduce Multi-view ControlNet (MVControl), a novel neural network architecture designed to enhance existing pre-trained multi-view diffusion models by integrating additional input conditions, such as edge, depth, normal, and scribble maps. Our innovation lies in the introduction of a conditioning module that controls the base diffusion model using both local and global embeddings, which are computed from the input condition images and camera poses. Once trained, MVControl is able to offer 3D diffusion guidance for optimization-based 3D generation. And, 2) we propose an efficient multi-stage 3D generation pipeline that leverages the benefits of recent large reconstruction models and score distillation algorithm. Building upon our MVControl architecture, we employ a unique hybrid diffusion guidance method to direct the optimization process. In pursuit of efficiency, we adopt 3D Gaussians as our representation instead of the commonly used implicit representations. We also pioneer the use of SuGaR, a hybrid representation that binds Gaussians to mesh triangle faces. This approach alleviates the issue of poor geometry in 3D Gaussians and enables the direct sculpting of fine-grained geometry on the mesh. Extensive experiments demonstrate that our method achieves robust generalization and enables the controllable generation of high-quality 3D content.

表面整合ガウススプラッティングによる制御可能なテキストから3D生成

Controllable Text-to-3D Generation via Surface-Aligned Gaussian Splatting

要旨

Support