3D 포인트 클라우드 생성을 위한 극단적 마스킹 기반 확산 트랜스포머의 고속 학습

초록

Diffusion Transformer는 최근 고품질 3D 포인트 클라우드 생성에서 뛰어난 효과를 보여주고 있습니다. 그러나 고해상도 3D 복셀을 위한 복셀 기반 diffusion 모델의 학습은 복셀의 추가 차원으로 인해 어텐션 연산자의 3차원 복잡성 때문에 여전히 매우 비용이 많이 듭니다. 3D가 2D에 비해 본질적으로 가지는 중복성에 착안하여, 우리는 효율적인 3D 포인트 클라우드 생성을 위해 맞춤화된 새로운 마스크 diffusion transformer인 FastDiT-3D를 제안하며, 이를 통해 학습 비용을 크게 절감합니다. 구체적으로, 우리는 마스크된 복셀화된 포인트 클라우드에서 디노이징 프로세스를 동적으로 수행하기 위해 마스크 오토인코더에서 영감을 얻었습니다. 또한, 복셀화된 포인트 클라우드에서 배경/전경 정보를 적응적으로 집계하기 위한 새로운 복셀 인식 마스킹 전략을 제안합니다. 우리의 방법은 거의 99%의 극단적인 마스킹 비율로도 최첨단 성능을 달성합니다. 더 나아가, 다중 카테고리 3D 생성을 개선하기 위해 3D diffusion 모델에 Mixture-of-Expert(MoE)를 도입했습니다. 각 카테고리는 서로 다른 전문가들과 함께 독자적인 diffusion 경로를 학습할 수 있어, 그래디언트 충돌을 완화합니다. ShapeNet 데이터셋에 대한 실험 결과는 우리의 방법이 최첨단의 고품질 및 다양한 3D 포인트 클라우드 생성 성능을 달성함을 보여줍니다. 우리의 FastDiT-3D는 원래 학습 비용의 6.5%만 사용하여 128 해상도의 복셀 포인트 클라우드를 생성할 때 1-Nearest Neighbor Accuracy와 Coverage 메트릭을 개선합니다.

English

Diffusion Transformers have recently shown remarkable effectiveness in generating high-quality 3D point clouds. However, training voxel-based diffusion models for high-resolution 3D voxels remains prohibitively expensive due to the cubic complexity of attention operators, which arises from the additional dimension of voxels. Motivated by the inherent redundancy of 3D compared to 2D, we propose FastDiT-3D, a novel masked diffusion transformer tailored for efficient 3D point cloud generation, which greatly reduces training costs. Specifically, we draw inspiration from masked autoencoders to dynamically operate the denoising process on masked voxelized point clouds. We also propose a novel voxel-aware masking strategy to adaptively aggregate background/foreground information from voxelized point clouds. Our method achieves state-of-the-art performance with an extreme masking ratio of nearly 99%. Moreover, to improve multi-category 3D generation, we introduce Mixture-of-Expert (MoE) in 3D diffusion model. Each category can learn a distinct diffusion path with different experts, relieving gradient conflict. Experimental results on the ShapeNet dataset demonstrate that our method achieves state-of-the-art high-fidelity and diverse 3D point cloud generation performance. Our FastDiT-3D improves 1-Nearest Neighbor Accuracy and Coverage metrics when generating 128-resolution voxel point clouds, using only 6.5% of the original training cost.

3D 포인트 클라우드 생성을 위한 극단적 마스킹 기반 확산 트랜스포머의 고속 학습

Fast Training of Diffusion Transformer with Extreme Masking for 3D Point Clouds Generation

초록

Support