Direct3D-S2: 공간적 희소 주의 메커니즘을 통해 간편하게 구현하는 기가스케일 3D 생성

초록

Signed Distance Function과 같은 볼륨 기반 표현을 사용하여 고해상도 3D 형태를 생성하는 것은 상당한 계산 및 메모리 문제를 야기합니다. 우리는 희소 볼륨 기반의 확장 가능한 3D 생성 프레임워크인 Direct3D S2를 소개하며, 이는 훈련 비용을 획기적으로 줄이면서도 우수한 출력 품질을 달성합니다. 우리의 핵심 혁신은 Spatial Sparse Attention(SSA) 메커니즘으로, 이는 희소 볼륨 데이터에서 Diffusion Transformer 계산의 효율성을 크게 향상시킵니다. SSA는 모델이 희소 볼륨 내에서 대규모 토큰 집합을 효과적으로 처리할 수 있게 하여 계산 오버헤드를 크게 줄이고, 순전파에서 3.9배, 역전파에서 9.6배의 속도 향상을 달성합니다. 또한, 우리의 프레임워크는 입력, 잠재, 출력 단계에서 일관된 희소 볼륨 형식을 유지하는 변이형 오토인코더를 포함합니다. 이전의 3D VAE에서 이질적 표현을 사용한 방법과 비교하여, 이러한 통합 설계는 훈련 효율성과 안정성을 크게 개선합니다. 우리의 모델은 공개된 데이터셋으로 훈련되었으며, 실험 결과 Direct3D S2는 생성 품질과 효율성에서 최신 기술을 능가할 뿐만 아니라, 256 해상도의 볼륨 표현에 일반적으로 최소 32개의 GPU가 필요한 작업을 8개의 GPU만으로 1024 해상도에서 훈련할 수 있게 하여, 기가스케일 3D 생성을 실용적이고 접근 가능하게 만듭니다. 프로젝트 페이지: https://nju3dv.github.io/projects/Direct3D-S2/.

English

Generating high resolution 3D shapes using volumetric representations such as Signed Distance Functions presents substantial computational and memory challenges. We introduce Direct3D S2, a scalable 3D generation framework based on sparse volumes that achieves superior output quality with dramatically reduced training costs. Our key innovation is the Spatial Sparse Attention mechanism, which greatly enhances the efficiency of Diffusion Transformer computations on sparse volumetric data. SSA allows the model to effectively process large token sets within sparse volumes, significantly reducing computational overhead and achieving a 3.9x speedup in the forward pass and a 9.6x speedup in the backward pass. Our framework also includes a variational autoencoder that maintains a consistent sparse volumetric format across input, latent, and output stages. Compared to previous methods with heterogeneous representations in 3D VAE, this unified design significantly improves training efficiency and stability. Our model is trained on public available datasets, and experiments demonstrate that Direct3D S2 not only surpasses state-of-the-art methods in generation quality and efficiency, but also enables training at 1024 resolution using only 8 GPUs, a task typically requiring at least 32 GPUs for volumetric representations at 256 resolution, thus making gigascale 3D generation both practical and accessible. Project page: https://nju3dv.github.io/projects/Direct3D-S2/.

Direct3D-S2: 공간적 희소 주의 메커니즘을 통해 간편하게 구현하는 기가스케일 3D 생성

Direct3D-S2: Gigascale 3D Generation Made Easy with Spatial Sparse Attention

초록

Support