DDiT:高效擴散變換器的動態補丁調度機制 (注:DDiT作為專有名詞保留不譯,採用"擴散變換器"對應Diffusion Transformers的標準譯法,"動態補丁調度"準確傳達Dynamic Patch Scheduling的技術內涵)
DDiT: Dynamic Patch Scheduling for Efficient Diffusion Transformers
February 19, 2026
作者: Dahye Kim, Deepti Ghadiyaram, Raghudeep Gadde
cs.AI
摘要
擴散轉換器(DiT)在圖像與影片生成領域已實現頂尖性能,但其成功伴隨著高昂的計算成本。這種效率低下的主因在於固定的標記化流程——無論內容複雜度如何,在整個去噪階段始終使用固定尺寸的圖像塊。我們提出動態標記化策略,這是一種高效的推理時方法,能根據內容複雜度與去噪時間步動態調整圖像塊尺寸。我們的核心洞見在於:早期時間步僅需較粗糙的圖像塊來建模全局結構,而後期迭代則需要更精細(尺寸更小)的圖像塊來完善局部細節。在推理過程中,我們的方法能為圖像與影片生成任務動態重分配去噪步驟間的圖像塊尺寸,在保持感知生成質量的同時大幅降低計算成本。大量實驗證實本方法的有效性:在FLUX-1.Dev和Wan 2.1模型上分別實現最高3.52倍與3.2倍的加速,且未犧牲生成質量與提示語遵循度。
English
Diffusion Transformers (DiTs) have achieved state-of-the-art performance in image and video generation, but their success comes at the cost of heavy computation. This inefficiency is largely due to the fixed tokenization process, which uses constant-sized patches throughout the entire denoising phase, regardless of the content's complexity. We propose dynamic tokenization, an efficient test-time strategy that varies patch sizes based on content complexity and the denoising timestep. Our key insight is that early timesteps only require coarser patches to model global structure, while later iterations demand finer (smaller-sized) patches to refine local details. During inference, our method dynamically reallocates patch sizes across denoising steps for image and video generation and substantially reduces cost while preserving perceptual generation quality. Extensive experiments demonstrate the effectiveness of our approach: it achieves up to 3.52times and 3.2times speedup on FLUX-1.Dev and Wan 2.1, respectively, without compromising the generation quality and prompt adherence.