Direct3D-S2:借助空间稀疏注意力实现千兆级3D生成的便捷之道
Direct3D-S2: Gigascale 3D Generation Made Easy with Spatial Sparse Attention
May 23, 2025
作者: Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Yikang Yang, Yajie Bao, Jiachen Qian, Siyu Zhu, Philip Torr, Xun Cao, Yao Yao
cs.AI
摘要
利用有符号距离函数等体积表示法生成高分辨率三维形状,面临着巨大的计算与内存挑战。我们提出了Direct3D S2,一个基于稀疏体素的可扩展三维生成框架,该框架在显著降低训练成本的同时,实现了卓越的输出质量。我们的核心创新在于空间稀疏注意力机制(Spatial Sparse Attention, SSA),它极大提升了扩散变换器在稀疏体素数据上的计算效率。SSA使得模型能够高效处理稀疏体素中的大规模标记集,显著减少了计算开销,在前向传播中实现了3.9倍的加速,在反向传播中更是达到了9.6倍的加速。此外,我们的框架还包含了一个变分自编码器,确保输入、潜在空间及输出阶段均保持一致的稀疏体素格式。相较于以往在三维变分自编码器中采用异构表示的方法,这一统一设计显著提升了训练效率与稳定性。我们的模型在公开数据集上进行了训练,实验结果表明,Direct3D S2不仅在生成质量和效率上超越了现有最先进方法,还能仅用8块GPU完成1024分辨率的训练任务,而传统体积表示法在256分辨率下通常需要至少32块GPU,从而使得千兆级三维生成变得既实用又易于实现。项目页面:https://nju3dv.github.io/projects/Direct3D-S2/。
English
Generating high resolution 3D shapes using volumetric representations such as
Signed Distance Functions presents substantial computational and memory
challenges. We introduce Direct3D S2, a scalable 3D generation framework based
on sparse volumes that achieves superior output quality with dramatically
reduced training costs. Our key innovation is the Spatial Sparse Attention
mechanism, which greatly enhances the efficiency of Diffusion Transformer
computations on sparse volumetric data. SSA allows the model to effectively
process large token sets within sparse volumes, significantly reducing
computational overhead and achieving a 3.9x speedup in the forward pass and a
9.6x speedup in the backward pass. Our framework also includes a variational
autoencoder that maintains a consistent sparse volumetric format across input,
latent, and output stages. Compared to previous methods with heterogeneous
representations in 3D VAE, this unified design significantly improves training
efficiency and stability. Our model is trained on public available datasets,
and experiments demonstrate that Direct3D S2 not only surpasses
state-of-the-art methods in generation quality and efficiency, but also enables
training at 1024 resolution using only 8 GPUs, a task typically requiring at
least 32 GPUs for volumetric representations at 256 resolution, thus making
gigascale 3D generation both practical and accessible. Project page:
https://nju3dv.github.io/projects/Direct3D-S2/.Summary
AI-Generated Summary