ChatPaper.aiChatPaper

DDiT:面向高效扩散变换器的动态补丁调度机制

DDiT: Dynamic Patch Scheduling for Efficient Diffusion Transformers

February 19, 2026
作者: Dahye Kim, Deepti Ghadiyaram, Raghudeep Gadde
cs.AI

摘要

扩散变换器(DiTs)在图像和视频生成领域已实现最先进性能,但其成功伴随着巨大的计算成本。这种低效性主要源于固定的标记化过程——在整个去噪阶段无论内容复杂度如何均采用恒定尺寸的图像块。我们提出动态标记化策略,这是一种基于内容复杂度与去噪时间步动态调整图像块尺寸的高效推理方法。我们的核心发现是:早期时间步仅需较粗粒度的图像块来建模全局结构,而后期迭代则需要更精细(小尺寸)的图像块来完善局部细节。在推理过程中,本方法通过动态重分配去噪步骤中的图像块尺寸,在保持感知生成质量的同时显著降低了计算成本。大量实验验证了本方法的有效性:在FLUX-1.Dev和Wan 2.1基准上分别实现了最高3.52倍和3.2倍的加速,且未损害生成质量与提示跟随能力。
English
Diffusion Transformers (DiTs) have achieved state-of-the-art performance in image and video generation, but their success comes at the cost of heavy computation. This inefficiency is largely due to the fixed tokenization process, which uses constant-sized patches throughout the entire denoising phase, regardless of the content's complexity. We propose dynamic tokenization, an efficient test-time strategy that varies patch sizes based on content complexity and the denoising timestep. Our key insight is that early timesteps only require coarser patches to model global structure, while later iterations demand finer (smaller-sized) patches to refine local details. During inference, our method dynamically reallocates patch sizes across denoising steps for image and video generation and substantially reduces cost while preserving perceptual generation quality. Extensive experiments demonstrate the effectiveness of our approach: it achieves up to 3.52times and 3.2times speedup on FLUX-1.Dev and Wan 2.1, respectively, without compromising the generation quality and prompt adherence.
PDF92February 21, 2026