ChatPaper.aiChatPaper

Direct3D-S2:透過空間稀疏注意力實現簡易的千兆級3D生成

Direct3D-S2: Gigascale 3D Generation Made Easy with Spatial Sparse Attention

May 23, 2025
作者: Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Yikang Yang, Yajie Bao, Jiachen Qian, Siyu Zhu, Philip Torr, Xun Cao, Yao Yao
cs.AI

摘要

利用如符號距離函數(Signed Distance Functions)等體積表示法生成高分辨率3D形狀,面臨著巨大的計算與記憶體挑戰。我們提出了Direct3D S2,這是一個基於稀疏體積的可擴展3D生成框架,它不僅實現了卓越的輸出質量,還大幅降低了訓練成本。我們的核心創新是空間稀疏注意力機制(Spatial Sparse Attention, SSA),它極大地提升了擴散變壓器在稀疏體積數據上的計算效率。SSA使得模型能夠高效處理稀疏體積中的大量令牌集,顯著減少了計算開銷,並在前向傳播中實現了3.9倍的加速,在反向傳播中實現了9.6倍的加速。我們的框架還包含一個變分自編碼器,它在輸入、潛在及輸出階段均保持一致的稀疏體積格式。與以往在3D VAE中採用異質表示的方法相比,這一統一設計顯著提升了訓練效率與穩定性。我們的模型在公開數據集上進行了訓練,實驗結果表明,Direct3D S2不僅在生成質量與效率上超越了現有最先進的方法,還能在僅使用8個GPU的情況下進行1024分辨率的訓練,而通常對於256分辨率的體積表示,這至少需要32個GPU,從而使得千兆級3D生成既實用又易於實現。項目頁面:https://nju3dv.github.io/projects/Direct3D-S2/。
English
Generating high resolution 3D shapes using volumetric representations such as Signed Distance Functions presents substantial computational and memory challenges. We introduce Direct3D S2, a scalable 3D generation framework based on sparse volumes that achieves superior output quality with dramatically reduced training costs. Our key innovation is the Spatial Sparse Attention mechanism, which greatly enhances the efficiency of Diffusion Transformer computations on sparse volumetric data. SSA allows the model to effectively process large token sets within sparse volumes, significantly reducing computational overhead and achieving a 3.9x speedup in the forward pass and a 9.6x speedup in the backward pass. Our framework also includes a variational autoencoder that maintains a consistent sparse volumetric format across input, latent, and output stages. Compared to previous methods with heterogeneous representations in 3D VAE, this unified design significantly improves training efficiency and stability. Our model is trained on public available datasets, and experiments demonstrate that Direct3D S2 not only surpasses state-of-the-art methods in generation quality and efficiency, but also enables training at 1024 resolution using only 8 GPUs, a task typically requiring at least 32 GPUs for volumetric representations at 256 resolution, thus making gigascale 3D generation both practical and accessible. Project page: https://nju3dv.github.io/projects/Direct3D-S2/.

Summary

AI-Generated Summary

PDF162May 26, 2025