动态分块扩散变换器

摘要

扩散变换器将图像处理为由静态分块操作生成的固定长度令牌序列。尽管这种设计有效，但它对低信息区域和高信息区域采用均匀计算，忽略了图像包含不同细节程度的区域，且去噪过程会从早期时间步的粗粒度结构逐渐过渡到后期时间步的细粒度细节。我们提出动态分块扩散变换器（DC-DiT），通过为DiT主干网络添加可学习的编码器-路由器-解码器支架，以端到端的扩散训练方式学习分块机制，从而根据数据特性自适应地将二维输入压缩为更短的令牌序列。该机制能够将均匀背景区域压缩为较少令牌，而将细节丰富区域保留更多令牌，并在无显式监督的情况下自然形成有意义的视觉分割。此外，该机制还能根据扩散时间步动态调整压缩策略：在噪声较多的阶段使用较少令牌，随着细部细节的显现增加令牌数量。在类别条件ImageNet 256×256数据集上，DC-DiT在4倍和16倍压缩比下，相较于参数匹配和FLOP匹配的DiT基线模型，均持续提升FID和Inception Score指标，表明这是一项具有潜力的技术，未来可进一步应用于像素空间、视频及3D生成领域。除精度提升外，DC-DiT兼具实用性：可从预训练的DiT检查点进行升级（最多减少8倍训练步数），并能与其他动态计算方法结合以进一步降低生成过程的FLOPs消耗。

English

Diffusion Transformers process images as fixed-length sequences of tokens produced by a static patchify operation. While effective, this design spends uniform compute on low- and high-information regions alike, ignoring that images contain regions of varying detail and that the denoising process progresses from coarse structure at early timesteps to fine detail at late timesteps. We introduce the Dynamic Chunking Diffusion Transformer (DC-DiT), which augments the DiT backbone with a learned encoder-router-decoder scaffold that adaptively compresses the 2D input into a shorter token sequence in a data-dependent manner using a chunking mechanism learned end-to-end with diffusion training. The mechanism learns to compress uniform background regions into fewer tokens and detail-rich regions into more tokens, with meaningful visual segmentations emerging without explicit supervision. Furthermore, it also learns to adapt its compression across diffusion timesteps, using fewer tokens at noisy stages and more tokens as fine details emerge. On class-conditional ImageNet 256{times}256, DC-DiT consistently improves FID and Inception Score over both parameter-matched and FLOP-matched DiT baselines across 4{times} and 16{times} compression, showing this is a promising technique with potential further applications to pixel-space, video and 3D generation. Beyond accuracy, DC-DiT is practical: it can be upcycled from pretrained DiT checkpoints with minimal post-training compute (up to 8{times} fewer training steps) and composes with other dynamic computation methods to further reduce generation FLOPs.