極端なマスキングを用いた3Dポイントクラウド生成のためのDiffusion Transformerの高速学習

要旨

Diffusion Transformersは最近、高品質な3Dポイントクラウドの生成において顕著な効果を示しています。しかし、高解像度の3Dボクセルに対するボクセルベースの拡散モデルの訓練は、ボクセルの追加次元に起因するAttention演算子の立方体複雑度のため、依然として非常に高コストです。3Dが2Dに比べて本質的に冗長であることに着目し、我々は効率的な3Dポイントクラウド生成に特化した新しいマスク拡散TransformerであるFastDiT-3Dを提案し、訓練コストを大幅に削減します。具体的には、マスクされたボクセル化ポイントクラウド上でノイズ除去プロセスを動的に操作するために、マスクオートエンコーダからインスピレーションを得ています。また、ボクセル化ポイントクラウドから背景/前景情報を適応的に集約するための新しいボクセル認識マスキング戦略を提案します。我々の手法は、約99%という極端なマスキング比率で最先端の性能を達成します。さらに、多カテゴリ3D生成を改善するために、3D拡散モデルにMixture-of-Expert（MoE）を導入します。各カテゴリは異なる専門家とともに個別の拡散経路を学習でき、勾配の衝突を緩和します。ShapeNetデータセットでの実験結果は、我々の手法が最先端の高忠実度かつ多様な3Dポイントクラウド生成性能を達成することを示しています。我々のFastDiT-3Dは、128解像度のボクセルポイントクラウドを生成する際に、1-Nearest Neighbor AccuracyとCoverageの指標を向上させ、元の訓練コストのわずか6.5%しか使用しません。

English

Diffusion Transformers have recently shown remarkable effectiveness in generating high-quality 3D point clouds. However, training voxel-based diffusion models for high-resolution 3D voxels remains prohibitively expensive due to the cubic complexity of attention operators, which arises from the additional dimension of voxels. Motivated by the inherent redundancy of 3D compared to 2D, we propose FastDiT-3D, a novel masked diffusion transformer tailored for efficient 3D point cloud generation, which greatly reduces training costs. Specifically, we draw inspiration from masked autoencoders to dynamically operate the denoising process on masked voxelized point clouds. We also propose a novel voxel-aware masking strategy to adaptively aggregate background/foreground information from voxelized point clouds. Our method achieves state-of-the-art performance with an extreme masking ratio of nearly 99%. Moreover, to improve multi-category 3D generation, we introduce Mixture-of-Expert (MoE) in 3D diffusion model. Each category can learn a distinct diffusion path with different experts, relieving gradient conflict. Experimental results on the ShapeNet dataset demonstrate that our method achieves state-of-the-art high-fidelity and diverse 3D point cloud generation performance. Our FastDiT-3D improves 1-Nearest Neighbor Accuracy and Coverage metrics when generating 128-resolution voxel point clouds, using only 6.5% of the original training cost.

極端なマスキングを用いた3Dポイントクラウド生成のためのDiffusion Transformerの高速学習

Fast Training of Diffusion Transformer with Extreme Masking for 3D Point Clouds Generation

要旨

Support