使用極端遮罩進行快速訓練的擴散Transformer,用於生成3D點雲
Fast Training of Diffusion Transformer with Extreme Masking for 3D Point Clouds Generation
December 12, 2023
作者: Shentong Mo, Enze Xie, Yue Wu, Junsong Chen, Matthias Nießner, Zhenguo Li
cs.AI
摘要
擴散Transformer最近在生成高質量的3D點雲方面表現出卓越的效果。然而,訓練基於體素的擴散模型以高分辨率3D體素仍然過於昂貴,這是由於注意力操作符的立方複雜度,這是由體素的額外維度引起的。受3D相對於2D的固有冗餘性的啟發,我們提出了FastDiT-3D,這是一種針對高效3D點雲生成的新型遮罩式擴散Transformer,大大降低了訓練成本。具體而言,我們從遮罩式自編碼器中汲取靈感,動態地在遮罩式體素化的點雲上進行去噪過程。我們還提出了一種新穎的體素感知遮罩策略,以自適應地聚合來自體素化點雲的背景/前景信息。我們的方法以近99%的極端遮罩比率實現了最先進的性能。此外,為了改善多類別3D生成,我們在3D擴散模型中引入了專家混合(MoE)。每個類別可以通過不同專家學習不同的擴散路徑,從而減輕梯度衝突。在ShapeNet數據集上的實驗結果表明,我們的方法實現了最先進的高保真度和多樣性的3D點雲生成性能。我們的FastDiT-3D在生成128分辨率體素點雲時,僅使用原始訓練成本的6.5%,提高了1-最近鄰準確度和覆蓋率指標。
English
Diffusion Transformers have recently shown remarkable effectiveness in
generating high-quality 3D point clouds. However, training voxel-based
diffusion models for high-resolution 3D voxels remains prohibitively expensive
due to the cubic complexity of attention operators, which arises from the
additional dimension of voxels. Motivated by the inherent redundancy of 3D
compared to 2D, we propose FastDiT-3D, a novel masked diffusion transformer
tailored for efficient 3D point cloud generation, which greatly reduces
training costs. Specifically, we draw inspiration from masked autoencoders to
dynamically operate the denoising process on masked voxelized point clouds. We
also propose a novel voxel-aware masking strategy to adaptively aggregate
background/foreground information from voxelized point clouds. Our method
achieves state-of-the-art performance with an extreme masking ratio of nearly
99%. Moreover, to improve multi-category 3D generation, we introduce
Mixture-of-Expert (MoE) in 3D diffusion model. Each category can learn a
distinct diffusion path with different experts, relieving gradient conflict.
Experimental results on the ShapeNet dataset demonstrate that our method
achieves state-of-the-art high-fidelity and diverse 3D point cloud generation
performance. Our FastDiT-3D improves 1-Nearest Neighbor Accuracy and Coverage
metrics when generating 128-resolution voxel point clouds, using only 6.5% of
the original training cost.