DiT-3D：探索用于3D形状生成的普通扩散Transformer

摘要

最近的扩散Transformer（例如DiT）已经展示了它们在生成高质量2D图像方面的强大有效性。然而，目前仍在确定Transformer架构在3D形状生成中是否表现同样出色，因为先前的3D扩散方法大多采用了U-Net架构。为了弥合这一差距，我们提出了一种新颖的用于3D形状生成的扩散Transformer，即DiT-3D，它可以直接在体素化点云上使用普通Transformer进行去噪处理。与现有的U-Net方法相比，我们的DiT-3D在模型规模上更具可扩展性，并且生成的质量更高。具体而言，DiT-3D采用了DiT的设计理念，但通过合并3D位置和补丁嵌入来调整地从体素化点云中聚合输入。为了降低在3D形状生成中自注意力的计算成本，我们在Transformer块中引入了3D窗口注意力，因为由于体素的额外维度导致的增加的3D令牌长度可能会导致高计算量。最后，线性和去体素化层用于预测去噪的点云。此外，我们的Transformer架构支持从2D到3D的高效微调，其中在ImageNet上预训练的DiT-2D检查点可以显著提高ShapeNet上的DiT-3D。在ShapeNet数据集上的实验结果表明，所提出的DiT-3D在高保真度和多样化的3D点云生成方面实现了最先进的性能。特别是，我们的DiT-3D将最先进方法的1-最近邻准确度降低了4.59，并在Chamfer距离评估时将覆盖度指标提高了3.51。

English

Recent Diffusion Transformers (e.g., DiT) have demonstrated their powerful effectiveness in generating high-quality 2D images. However, it is still being determined whether the Transformer architecture performs equally well in 3D shape generation, as previous 3D diffusion methods mostly adopted the U-Net architecture. To bridge this gap, we propose a novel Diffusion Transformer for 3D shape generation, namely DiT-3D, which can directly operate the denoising process on voxelized point clouds using plain Transformers. Compared to existing U-Net approaches, our DiT-3D is more scalable in model size and produces much higher quality generations. Specifically, the DiT-3D adopts the design philosophy of DiT but modifies it by incorporating 3D positional and patch embeddings to adaptively aggregate input from voxelized point clouds. To reduce the computational cost of self-attention in 3D shape generation, we incorporate 3D window attention into Transformer blocks, as the increased 3D token length resulting from the additional dimension of voxels can lead to high computation. Finally, linear and devoxelization layers are used to predict the denoised point clouds. In addition, our transformer architecture supports efficient fine-tuning from 2D to 3D, where the pre-trained DiT-2D checkpoint on ImageNet can significantly improve DiT-3D on ShapeNet. Experimental results on the ShapeNet dataset demonstrate that the proposed DiT-3D achieves state-of-the-art performance in high-fidelity and diverse 3D point cloud generation. In particular, our DiT-3D decreases the 1-Nearest Neighbor Accuracy of the state-of-the-art method by 4.59 and increases the Coverage metric by 3.51 when evaluated on Chamfer Distance.

DiT-3D：探索用于3D形状生成的普通扩散Transformer

DiT-3D: Exploring Plain Diffusion Transformers for 3D Shape Generation

摘要

Support