Direct3D-S2：借助空间稀疏注意力实现千兆级3D生成的便捷之道

摘要

利用有符号距离函数等体积表示法生成高分辨率三维形状，面临着巨大的计算与内存挑战。我们提出了Direct3D S2，一个基于稀疏体素的可扩展三维生成框架，该框架在显著降低训练成本的同时，实现了卓越的输出质量。我们的核心创新在于空间稀疏注意力机制（Spatial Sparse Attention, SSA），它极大提升了扩散变换器在稀疏体素数据上的计算效率。SSA使得模型能够高效处理稀疏体素中的大规模标记集，显著减少了计算开销，在前向传播中实现了3.9倍的加速，在反向传播中更是达到了9.6倍的加速。此外，我们的框架还包含了一个变分自编码器，确保输入、潜在空间及输出阶段均保持一致的稀疏体素格式。相较于以往在三维变分自编码器中采用异构表示的方法，这一统一设计显著提升了训练效率与稳定性。我们的模型在公开数据集上进行了训练，实验结果表明，Direct3D S2不仅在生成质量和效率上超越了现有最先进方法，还能仅用8块GPU完成1024分辨率的训练任务，而传统体积表示法在256分辨率下通常需要至少32块GPU，从而使得千兆级三维生成变得既实用又易于实现。项目页面：https://nju3dv.github.io/projects/Direct3D-S2/。

English

Generating high resolution 3D shapes using volumetric representations such as Signed Distance Functions presents substantial computational and memory challenges. We introduce Direct3D S2, a scalable 3D generation framework based on sparse volumes that achieves superior output quality with dramatically reduced training costs. Our key innovation is the Spatial Sparse Attention mechanism, which greatly enhances the efficiency of Diffusion Transformer computations on sparse volumetric data. SSA allows the model to effectively process large token sets within sparse volumes, significantly reducing computational overhead and achieving a 3.9x speedup in the forward pass and a 9.6x speedup in the backward pass. Our framework also includes a variational autoencoder that maintains a consistent sparse volumetric format across input, latent, and output stages. Compared to previous methods with heterogeneous representations in 3D VAE, this unified design significantly improves training efficiency and stability. Our model is trained on public available datasets, and experiments demonstrate that Direct3D S2 not only surpasses state-of-the-art methods in generation quality and efficiency, but also enables training at 1024 resolution using only 8 GPUs, a task typically requiring at least 32 GPUs for volumetric representations at 256 resolution, thus making gigascale 3D generation both practical and accessible. Project page: https://nju3dv.github.io/projects/Direct3D-S2/.

Direct3D-S2：借助空间稀疏注意力实现千兆级3D生成的便捷之道

Direct3D-S2: Gigascale 3D Generation Made Easy with Spatial Sparse Attention

摘要

Support