ChatPaper.aiChatPaper

FlexiDiT:您的擴散變換器能輕鬆生成高品質樣本,且計算需求更低

FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute

February 27, 2025
作者: Sotiris Anagnostidis, Gregor Bachmann, Yeongmin Kim, Jonas Kohler, Markos Georgopoulos, Artsiom Sanakoyeu, Yuming Du, Albert Pumarola, Ali Thabet, Edgar Schönfeld
cs.AI

摘要

儘管現代擴散變換器表現卓越,但其在推理階段面臨著龐大的資源需求,這源於每個去噪步驟所需的固定且大量的計算。在本研究中,我們重新審視了傳統的靜態範式,即每次去噪迭代分配固定計算預算的做法,並提出了一種動態策略作為替代。我們簡潔且樣本高效的框架使得預訓練的擴散變換器模型能夠轉化為靈活的版本——我們稱之為FlexiDiT——使其能在不同的計算預算下處理輸入。我們展示了單一靈活模型如何在不降低圖像質量的前提下生成圖像,同時在類別條件和文本條件圖像生成任務中,相比靜態模型,所需浮點運算次數(FLOPs)減少超過40%。我們的方法具有普遍性,且對輸入和條件模式保持中立。我們還展示了如何將此方法輕鬆擴展至視頻生成領域,其中FlexiDiT模型在保持性能不變的情況下,最多可減少75%的計算量。
English
Despite their remarkable performance, modern Diffusion Transformers are hindered by substantial resource requirements during inference, stemming from the fixed and large amount of compute needed for each denoising step. In this work, we revisit the conventional static paradigm that allocates a fixed compute budget per denoising iteration and propose a dynamic strategy instead. Our simple and sample-efficient framework enables pre-trained DiT models to be converted into flexible ones -- dubbed FlexiDiT -- allowing them to process inputs at varying compute budgets. We demonstrate how a single flexible model can generate images without any drop in quality, while reducing the required FLOPs by more than 40\% compared to their static counterparts, for both class-conditioned and text-conditioned image generation. Our method is general and agnostic to input and conditioning modalities. We show how our approach can be readily extended for video generation, where FlexiDiT models generate samples with up to 75\% less compute without compromising performance.

Summary

AI-Generated Summary

PDF202February 28, 2025