动态分块扩散变换器

摘要

扩散变换器（DiT）通过静态分块操作将图像处理为固定长度的令牌序列。尽管这种设计有效，但其对低信息与高信息区域采用均等计算量，忽略了图像各区域细节密度的差异性，以及去噪过程从早期时间步的粗略结构向后期精细细节演变的特性。我们提出动态分块扩散变换器（DC-DiT），在DiT主干网络上增加可学习的编码器-路由器-解码器支架，通过端到端扩散训练习得的分块机制，以数据依赖方式将二维输入自适应压缩为更短的令牌序列。该机制能自动将均匀背景区域压缩为较少令牌，同时为细节丰富区域分配更多令牌，在没有显式监督的情况下形成有意义的视觉分割。此外，它还能根据扩散时间步调整压缩策略：在噪声较多的阶段使用较少令牌，随着精细细节显现逐渐增加令牌数量。在类别条件ImageNet 256×256数据集上，DC-DiT在4倍和16倍压缩条件下，相较于参数匹配和FLOP匹配的DiT基线模型，FID和初始分数指标均持续提升，表明这是一项具有潜力的技术，可进一步应用于像素空间、视频及3D生成领域。除精度优势外，DC-DiT具备实用价值：可从预训练DiT检查点进行升级（训练步数最多减少8倍），并能与其他动态计算方法结合进一步降低生成过程的FLOPs消耗。

English

Diffusion Transformers process images as fixed-length sequences of tokens produced by a static patchify operation. While effective, this design spends uniform compute on low- and high-information regions alike, ignoring that images contain regions of varying detail and that the denoising process progresses from coarse structure at early timesteps to fine detail at late timesteps. We introduce the Dynamic Chunking Diffusion Transformer (DC-DiT), which augments the DiT backbone with a learned encoder-router-decoder scaffold that adaptively compresses the 2D input into a shorter token sequence in a data-dependent manner using a chunking mechanism learned end-to-end with diffusion training. The mechanism learns to compress uniform background regions into fewer tokens and detail-rich regions into more tokens, with meaningful visual segmentations emerging without explicit supervision. Furthermore, it also learns to adapt its compression across diffusion timesteps, using fewer tokens at noisy stages and more tokens as fine details emerge. On class-conditional ImageNet 256{times}256, DC-DiT consistently improves FID and Inception Score over both parameter-matched and FLOP-matched DiT baselines across 4{times} and 16{times} compression, showing this is a promising technique with potential further applications to pixel-space, video and 3D generation. Beyond accuracy, DC-DiT is practical: it can be upcycled from pretrained DiT checkpoints with minimal post-training compute (up to 8{times} fewer training steps) and composes with other dynamic computation methods to further reduce generation FLOPs.