GaussianAnything：3D生成のためのインタラクティブなポイントクラウド潜在拡散

要旨

3Dコンテンツ生成は大きく進歩していますが、既存の方法は入力形式、潜在空間設計、出力表現に課題を抱えています。本論文では、これらの課題に対処する革新的な3D生成フレームワークを紹介します。このフレームワークは、インタラクティブなポイントクラウド構造の潜在空間を使用し、スケーラブルで高品質な3D生成を提供します。当フレームワークは、入力としてマルチビューのRGB-D(深度)-N(法線)レンダリングを使用する変分オートエンコーダ(Variational Autoencoder, VAE)を採用し、3D形状情報を保持する独自の潜在空間設計を行い、改善された形状-テクスチャの分離のためにカスケード状の潜在拡散モデルを組み込んでいます。提案された手法である「GaussianAnything」は、ポイントクラウド、キャプション、およびシングル/マルチビュー画像の入力をサポートするマルチモーダル条件付き3D生成を可能とします。特筆すべきは、新たに提案された潜在空間が幾何学-テクスチャの分離を自然に可能にし、したがって3Dに関する編集を可能にすることです。実験結果は、複数のデータセットでの当手法の効果を示し、テキストおよび画像条件付きの3D生成の両方で既存の方法を凌駕しています。

English

While 3D content generation has advanced significantly, existing methods still face challenges with input formats, latent space design, and output representations. This paper introduces a novel 3D generation framework that addresses these challenges, offering scalable, high-quality 3D generation with an interactive Point Cloud-structured Latent space. Our framework employs a Variational Autoencoder (VAE) with multi-view posed RGB-D(epth)-N(ormal) renderings as input, using a unique latent space design that preserves 3D shape information, and incorporates a cascaded latent diffusion model for improved shape-texture disentanglement. The proposed method, GaussianAnything, supports multi-modal conditional 3D generation, allowing for point cloud, caption, and single/multi-view image inputs. Notably, the newly proposed latent space naturally enables geometry-texture disentanglement, thus allowing 3D-aware editing. Experimental results demonstrate the effectiveness of our approach on multiple datasets, outperforming existing methods in both text- and image-conditioned 3D generation.

GaussianAnything：3D生成のためのインタラクティブなポイントクラウド潜在拡散

GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation

要旨

Support