MPDiT: 효율적인 플로우 매칭 및 디퓨전 모델을 위한 다중 패치 글로벌-로컬 트랜스포머 아키텍처

초록

트랜스포머 아키텍처, 특히 Diffusion Transformer(DiT)는 합성곱 U-Net 대비 강력한 성능으로 인해 디퓨전 및 플로우 매칭 모델에서 널리 사용되고 있습니다. 그러나 DiT의 등방성 설계는 모든 블록에서 동일한 수의 패치화된 토큰을 처리하므로 학습 과정에서 상대적으로 높은 계산 부하를 초래합니다. 본 연구에서는 초기 블록에서는 거시적인 전역 맥락을 포착하기 위해 더 큰 패치를 사용하고, 후기 블록에서는 국부적 세부 사항을 정제하기 위해 더 작은 패치를 사용하는 다중 패치 트랜스포머 설계를 소개합니다. 이러한 계층적 설계는 우수한 생성 성능을 유지하면서 GFLOPs 기준 계산 비용을 최대 50%까지 절감할 수 있습니다. 또한 시간 임베딩과 클래스 임베딩의 개선된 설계를 제안하여 학습 수렴 속도를 가속화합니다. ImageNet 데이터셋에 대한 광범위한 실험을 통해 우리의 아키텍처 선택이 효과적임을 입증합니다. 코드는 https://github.com/quandao10/MPDiT 에 공개되어 있습니다.

English

Transformer architectures, particularly Diffusion Transformers (DiTs), have become widely used in diffusion and flow-matching models due to their strong performance compared to convolutional UNets. However, the isotropic design of DiTs processes the same number of patchified tokens in every block, leading to relatively heavy computation during training process. In this work, we introduce a multi-patch transformer design in which early blocks operate on larger patches to capture coarse global context, while later blocks use smaller patches to refine local details. This hierarchical design could reduces computational cost by up to 50\% in GFLOPs while achieving good generative performance. In addition, we also propose improved designs for time and class embeddings that accelerate training convergence. Extensive experiments on the ImageNet dataset demonstrate the effectiveness of our architectural choices. Code is released at https://github.com/quandao10/MPDiT

MPDiT: 효율적인 플로우 매칭 및 디퓨전 모델을 위한 다중 패치 글로벌-로컬 트랜스포머 아키텍처

MPDiT: Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model

초록

Support