Direct3D-S2: Gigaschaal 3D-generatie vereenvoudigd met ruimtelijk sparse aandacht

Samenvatting

Het genereren van hoogwaardige 3D-vormen met behulp van volumetrische representaties zoals Signed Distance Functions brengt aanzienlijke computationele en geheugenuitdagingen met zich mee. Wij introduceren Direct3D S2, een schaalbare 3D-generatieframework gebaseerd op sparse volumes dat superieure uitvoerkwaliteit bereikt met aanzienlijk gereduceerde trainingskosten. Onze belangrijkste innovatie is het Spatial Sparse Attention-mechanisme, dat de efficiëntie van Diffusion Transformer-berekeningen op sparse volumetrische data aanzienlijk verbetert. SSA stelt het model in staat om grote tokensets binnen sparse volumes effectief te verwerken, waardoor de computationele overhead aanzienlijk wordt verminderd en een 3,9x versnelling in de forward pass en een 9,6x versnelling in de backward pass wordt bereikt. Ons framework omvat ook een variational autoencoder die een consistente sparse volumetrische indeling handhaaft over de input-, latent- en outputfasen. In vergelijking met eerdere methoden met heterogene representaties in 3D VAE, verbetert dit uniforme ontwerp de trainings efficiëntie en stabiliteit aanzienlijk. Ons model is getraind op publiek beschikbare datasets, en experimenten tonen aan dat Direct3D S2 niet alleen state-of-the-art methoden overtreft in generatiekwaliteit en efficiëntie, maar ook training op 1024 resolutie mogelijk maakt met slechts 8 GPU's, een taak die normaal gesproken minstens 32 GPU's vereist voor volumetrische representaties op 256 resolutie, waardoor gigascale 3D-generatie zowel praktisch als toegankelijk wordt. Projectpagina: https://nju3dv.github.io/projects/Direct3D-S2/.

English

Generating high resolution 3D shapes using volumetric representations such as Signed Distance Functions presents substantial computational and memory challenges. We introduce Direct3D S2, a scalable 3D generation framework based on sparse volumes that achieves superior output quality with dramatically reduced training costs. Our key innovation is the Spatial Sparse Attention mechanism, which greatly enhances the efficiency of Diffusion Transformer computations on sparse volumetric data. SSA allows the model to effectively process large token sets within sparse volumes, significantly reducing computational overhead and achieving a 3.9x speedup in the forward pass and a 9.6x speedup in the backward pass. Our framework also includes a variational autoencoder that maintains a consistent sparse volumetric format across input, latent, and output stages. Compared to previous methods with heterogeneous representations in 3D VAE, this unified design significantly improves training efficiency and stability. Our model is trained on public available datasets, and experiments demonstrate that Direct3D S2 not only surpasses state-of-the-art methods in generation quality and efficiency, but also enables training at 1024 resolution using only 8 GPUs, a task typically requiring at least 32 GPUs for volumetric representations at 256 resolution, thus making gigascale 3D generation both practical and accessible. Project page: https://nju3dv.github.io/projects/Direct3D-S2/.

Direct3D-S2: Gigaschaal 3D-generatie vereenvoudigd met ruimtelijk sparse aandacht

Direct3D-S2: Gigascale 3D Generation Made Easy with Spatial Sparse Attention

Samenvatting

Support