MPDiT：面向高效流匹配与扩散模型的多区块全局-局部Transformer架构

摘要

Transformer架构（特别是扩散Transformer）因其性能优于卷积UNet，已在扩散模型与流匹配模型中广泛应用。然而，DiT的各向同性设计导致每个区块需处理相同数量的分块标记，使得训练过程计算量较大。本研究提出多分块Transformer架构：浅层区块采用大分块捕捉粗粒度全局上下文，深层区块使用小分块细化局部细节。这种分层设计在保持优异生成性能的同时，可将计算量降低最高达50%（以GFLOPs计）。此外，我们还改进了时间嵌入与类别嵌入的设计，以加速训练收敛。在ImageNet数据集上的大量实验验证了本架构设计的有效性。代码已发布于https://github.com/quandao10/MPDiT。

English

Transformer architectures, particularly Diffusion Transformers (DiTs), have become widely used in diffusion and flow-matching models due to their strong performance compared to convolutional UNets. However, the isotropic design of DiTs processes the same number of patchified tokens in every block, leading to relatively heavy computation during training process. In this work, we introduce a multi-patch transformer design in which early blocks operate on larger patches to capture coarse global context, while later blocks use smaller patches to refine local details. This hierarchical design could reduces computational cost by up to 50\% in GFLOPs while achieving good generative performance. In addition, we also propose improved designs for time and class embeddings that accelerate training convergence. Extensive experiments on the ImageNet dataset demonstrate the effectiveness of our architectural choices. Code is released at https://github.com/quandao10/MPDiT

MPDiT：面向高效流匹配与扩散模型的多区块全局-局部Transformer架构

MPDiT: Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model

摘要

Support