使用极端遮罩的扩散变压器快速训练用于3D点云生成
Fast Training of Diffusion Transformer with Extreme Masking for 3D Point Clouds Generation
December 12, 2023
作者: Shentong Mo, Enze Xie, Yue Wu, Junsong Chen, Matthias Nießner, Zhenguo Li
cs.AI
摘要
扩散Transformer 最近展现出在生成高质量3D点云方面的显著效果。然而,训练基于体素的扩散模型以获得高分辨率的3D体素仍然成本过高,这是由于注意力算子的立方复杂度,这种复杂度源于体素的额外维度。受3D相对于2D的固有冗余性的启发,我们提出了FastDiT-3D,这是一种专为高效3D点云生成量身定制的新型掩蔽扩散Transformer,大大降低了训练成本。具体来说,我们从掩蔽自编码器中汲取灵感,动态地在经过掩蔽的体素化点云上进行去噪处理。我们还提出了一种新颖的体素感知掩蔽策略,以自适应地聚合来自体素化点云的背景/前景信息。我们的方法在接近99%的极端掩蔽比率下实现了最先进的性能。此外,为了改善多类别3D生成,我们在3D扩散模型中引入了专家混合(MoE)。每个类别可以学习具有不同专家的不同扩散路径,缓解梯度冲突。在ShapeNet数据集上的实验结果表明,我们的方法实现了最先进的高保真度和多样化3D点云生成性能。我们的FastDiT-3D 在生成128分辨率体素点云时,仅使用原始训练成本的6.5%,提高了1-最近邻准确度和覆盖度指标。
English
Diffusion Transformers have recently shown remarkable effectiveness in
generating high-quality 3D point clouds. However, training voxel-based
diffusion models for high-resolution 3D voxels remains prohibitively expensive
due to the cubic complexity of attention operators, which arises from the
additional dimension of voxels. Motivated by the inherent redundancy of 3D
compared to 2D, we propose FastDiT-3D, a novel masked diffusion transformer
tailored for efficient 3D point cloud generation, which greatly reduces
training costs. Specifically, we draw inspiration from masked autoencoders to
dynamically operate the denoising process on masked voxelized point clouds. We
also propose a novel voxel-aware masking strategy to adaptively aggregate
background/foreground information from voxelized point clouds. Our method
achieves state-of-the-art performance with an extreme masking ratio of nearly
99%. Moreover, to improve multi-category 3D generation, we introduce
Mixture-of-Expert (MoE) in 3D diffusion model. Each category can learn a
distinct diffusion path with different experts, relieving gradient conflict.
Experimental results on the ShapeNet dataset demonstrate that our method
achieves state-of-the-art high-fidelity and diverse 3D point cloud generation
performance. Our FastDiT-3D improves 1-Nearest Neighbor Accuracy and Coverage
metrics when generating 128-resolution voxel point clouds, using only 6.5% of
the original training cost.