DiT-3D：探索用於3D形狀生成的普通擴散Transformer

摘要

最近的擴散Transformer（例如DiT）已經展示了它們在生成高質量2D圖像方面的強大效果。然而，目前仍在確定Transformer架構在3D形狀生成方面的表現是否同樣出色，因為先前的3D擴散方法主要採用了U-Net架構。為了彌合這一差距，我們提出了一種新穎的用於3D形狀生成的擴散Transformer，即DiT-3D，它可以直接在採用普通Transformer的體素化點雲上進行去噪過程。與現有的U-Net方法相比，我們的DiT-3D在模型大小上更具可擴展性，並且生成的質量更高。具體來說，DiT-3D採用了DiT的設計理念，但通過將3D位置和補丁嵌入結合到其中，以自適應地聚合來自體素化點雲的輸入。為了降低在3D形狀生成中自注意力的計算成本，我們在Transformer塊中加入了3D窗口注意力，因為由於體素的額外維度導致的增加的3D令牌長度可能導致高計算量。最後，使用線性和去體素化層來預測去噪的點雲。此外，我們的Transformer架構支持從2D到3D的有效微調，其中在ImageNet上預先訓練的DiT-2D檢查點可以顯著改善ShapeNet上的DiT-3D。在ShapeNet數據集上的實驗結果表明，所提出的DiT-3D在高保真度和多樣性的3D點雲生成方面實現了最先進的性能。特別是，我們的DiT-3D在Chamfer距離評估時將最先進方法的1-最近鄰準確度降低了4.59，並將覆蓋率指標提高了3.51。

English

Recent Diffusion Transformers (e.g., DiT) have demonstrated their powerful effectiveness in generating high-quality 2D images. However, it is still being determined whether the Transformer architecture performs equally well in 3D shape generation, as previous 3D diffusion methods mostly adopted the U-Net architecture. To bridge this gap, we propose a novel Diffusion Transformer for 3D shape generation, namely DiT-3D, which can directly operate the denoising process on voxelized point clouds using plain Transformers. Compared to existing U-Net approaches, our DiT-3D is more scalable in model size and produces much higher quality generations. Specifically, the DiT-3D adopts the design philosophy of DiT but modifies it by incorporating 3D positional and patch embeddings to adaptively aggregate input from voxelized point clouds. To reduce the computational cost of self-attention in 3D shape generation, we incorporate 3D window attention into Transformer blocks, as the increased 3D token length resulting from the additional dimension of voxels can lead to high computation. Finally, linear and devoxelization layers are used to predict the denoised point clouds. In addition, our transformer architecture supports efficient fine-tuning from 2D to 3D, where the pre-trained DiT-2D checkpoint on ImageNet can significantly improve DiT-3D on ShapeNet. Experimental results on the ShapeNet dataset demonstrate that the proposed DiT-3D achieves state-of-the-art performance in high-fidelity and diverse 3D point cloud generation. In particular, our DiT-3D decreases the 1-Nearest Neighbor Accuracy of the state-of-the-art method by 4.59 and increases the Coverage metric by 3.51 when evaluated on Chamfer Distance.

DiT-3D：探索用於3D形狀生成的普通擴散Transformer

DiT-3D: Exploring Plain Diffusion Transformers for 3D Shape Generation

摘要

Support