高速かつスケーラブルな単一段階の画像から3D生成へのために、ガウススプラッティングを拡散デノイザーに組み込む

要旨

既存のフィードフォワード画像から3Dへの手法は、主に3Dの一貫性を保証できない2Dマルチビュー拡散モデルに依存しています。これらの手法は、プロンプトビューの方向を変更すると簡単に崩壊し、主にオブジェクト中心のプロンプト画像を処理します。本論文では、単一ステージの新しい3D拡散モデルであるDiffusionGSを提案し、単一ビューからのオブジェクトとシーン生成を行います。DiffusionGSは、各タイムステップで3Dガウス点群を直接出力し、ビューの一貫性を強化し、オブジェクト中心の入力を超えて、任意の方向のプロンプトビューを与えられた場合にロバストに生成することができます。さらに、DiffusionGSの能力と汎化能力を向上させるために、シーン-オブジェクト混合トレーニング戦略を開発して3Dトレーニングデータを拡大します。実験結果は、当社の手法がより優れた生成品質（PSNRで2.20 dB高、FIDで23.25低）を提供し、SOTA手法よりも5倍以上高速（A100 GPU上で約6秒）であることを示しています。ユーザースタディとテキストから3Dへの応用も、当社の手法の実用的な価値を明らかにしています。プロジェクトページhttps://caiyuanhao1998.github.io/project/DiffusionGS/には、ビデオとインタラクティブな生成結果が表示されています。

English

Existing feed-forward image-to-3D methods mainly rely on 2D multi-view diffusion models that cannot guarantee 3D consistency. These methods easily collapse when changing the prompt view direction and mainly handle object-centric prompt images. In this paper, we propose a novel single-stage 3D diffusion model, DiffusionGS, for object and scene generation from a single view. DiffusionGS directly outputs 3D Gaussian point clouds at each timestep to enforce view consistency and allow the model to generate robustly given prompt views of any directions, beyond object-centric inputs. Plus, to improve the capability and generalization ability of DiffusionGS, we scale up 3D training data by developing a scene-object mixed training strategy. Experiments show that our method enjoys better generation quality (2.20 dB higher in PSNR and 23.25 lower in FID) and over 5x faster speed (~6s on an A100 GPU) than SOTA methods. The user study and text-to-3D applications also reveals the practical values of our method. Our Project page at https://caiyuanhao1998.github.io/project/DiffusionGS/ shows the video and interactive generation results.

高速かつスケーラブルな単一段階の画像から3D生成へのために、ガウススプラッティングを拡散デノイザーに組み込む

Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation

要旨

Support