LN3Diff: 高速3D生成のためのスケーラブルな潜在ニューラルフィールド拡散

要旨

ニューラルレンダリングの分野は、生成モデルと微分可能レンダリング技術の進歩により、大きな進展を遂げてきました。2D拡散モデルは成功を収めていますが、統一的な3D拡散パイプラインは未だ確立されていません。本論文では、このギャップを埋め、高速で高品質かつ汎用的な条件付き3D生成を可能にする新しいフレームワーク「LN3Diff」を提案します。我々のアプローチは、3D認識アーキテクチャと変分オートエンコーダ（VAE）を活用し、入力画像を構造化されたコンパクトな3D潜在空間にエンコードします。この潜在表現は、トランスフォーマーベースのデコーダによって高容量の3Dニューラルフィールドにデコードされます。この3D認識潜在空間上で拡散モデルを訓練することにより、我々の手法はShapeNetにおける3D生成で最先端の性能を達成し、単眼3D再構成や様々なデータセットにわたる条件付き3D生成において優れた性能を示します。さらに、インスタンスごとの最適化を必要とせず、推論速度において既存の3D拡散手法を凌駕します。提案するLN3Diffは、3D生成モデリングにおける重要な進歩であり、3D視覚およびグラフィックスタスクにおける様々な応用が期待されます。

English

The field of neural rendering has witnessed significant progress with advancements in generative models and differentiable rendering techniques. Though 2D diffusion has achieved success, a unified 3D diffusion pipeline remains unsettled. This paper introduces a novel framework called LN3Diff to address this gap and enable fast, high-quality, and generic conditional 3D generation. Our approach harnesses a 3D-aware architecture and variational autoencoder (VAE) to encode the input image into a structured, compact, and 3D latent space. The latent is decoded by a transformer-based decoder into a high-capacity 3D neural field. Through training a diffusion model on this 3D-aware latent space, our method achieves state-of-the-art performance on ShapeNet for 3D generation and demonstrates superior performance in monocular 3D reconstruction and conditional 3D generation across various datasets. Moreover, it surpasses existing 3D diffusion methods in terms of inference speed, requiring no per-instance optimization. Our proposed LN3Diff presents a significant advancement in 3D generative modeling and holds promise for various applications in 3D vision and graphics tasks.

LN3Diff: 高速3D生成のためのスケーラブルな潜在ニューラルフィールド拡散

LN3Diff: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation

要旨

Support