Direct3D-S2: 空間的スパースアテンションによるギガスケール3D生成の簡易化

要旨

Signed Distance Function（SDF）などの体積表現を用いて高解像度の3D形状を生成することは、計算コストとメモリ使用量の面で大きな課題を抱えています。本論文では、スパースボリュームに基づくスケーラブルな3D生成フレームワークであるDirect3D S2を提案します。このフレームワークは、大幅に削減されたトレーニングコストで優れた出力品質を実現します。私たちの主な革新は、Spatial Sparse Attention（SSA）メカニズムであり、これによりスパースボリュームデータ上でのDiffusion Transformerの計算効率が大幅に向上します。SSAにより、モデルはスパースボリューム内の大規模なトークンセットを効果的に処理でき、計算オーバーヘッドを大幅に削減し、フォワードパスで3.9倍、バックワードパスで9.6倍の高速化を達成します。また、本フレームワークには、入力、潜在変数、出力の各段階で一貫したスパースボリューム形式を維持する変分オートエンコーダーも含まれています。従来の3D VAEにおける異種表現と比較して、この統一された設計はトレーニング効率と安定性を大幅に向上させます。私たちのモデルは公開されているデータセットでトレーニングされており、実験結果は、Direct3D S2が生成品質と効率の両面で最先端の手法を凌駕するだけでなく、1024解像度でのトレーニングをわずか8台のGPUで可能にすることを示しています。これは、256解像度での体積表現には通常少なくとも32台のGPUを必要とするタスクであり、ギガスケールの3D生成を実用的かつアクセス可能なものにします。プロジェクトページ: https://nju3dv.github.io/projects/Direct3D-S2/。

English

Generating high resolution 3D shapes using volumetric representations such as Signed Distance Functions presents substantial computational and memory challenges. We introduce Direct3D S2, a scalable 3D generation framework based on sparse volumes that achieves superior output quality with dramatically reduced training costs. Our key innovation is the Spatial Sparse Attention mechanism, which greatly enhances the efficiency of Diffusion Transformer computations on sparse volumetric data. SSA allows the model to effectively process large token sets within sparse volumes, significantly reducing computational overhead and achieving a 3.9x speedup in the forward pass and a 9.6x speedup in the backward pass. Our framework also includes a variational autoencoder that maintains a consistent sparse volumetric format across input, latent, and output stages. Compared to previous methods with heterogeneous representations in 3D VAE, this unified design significantly improves training efficiency and stability. Our model is trained on public available datasets, and experiments demonstrate that Direct3D S2 not only surpasses state-of-the-art methods in generation quality and efficiency, but also enables training at 1024 resolution using only 8 GPUs, a task typically requiring at least 32 GPUs for volumetric representations at 256 resolution, thus making gigascale 3D generation both practical and accessible. Project page: https://nju3dv.github.io/projects/Direct3D-S2/.

Direct3D-S2: 空間的スパースアテンションによるギガスケール3D生成の簡易化

Direct3D-S2: Gigascale 3D Generation Made Easy with Spatial Sparse Attention

要旨

Support