重新思考扩散Transformer中的跨层信息路由

摘要

扩散变换器（Diffusion Transformers, DiTs）已成为现代视觉生成领域的事实标准骨干架构，其设计的几乎所有核心维度——包括分词、注意力机制、条件控制、训练目标以及潜变量自编码器——均已得到广泛而深入的重新审视。然而，控制各层间信息累积方式的残差流却直接沿用了原始Transformer的设计。本文对DiTs中跨层信息流进行了系统性的实证分析，结合网络深度与去噪时间步两个维度，识别出传统残差加法存在的三种具体症状：前向幅度的单调增长、反向梯度急剧衰减以及显著的模块间冗余。基于这一诊断，我们提出扩散自适应路由（Diffusion-Adaptive Routing, DAR），这是一种即插即用的残差替代方案，能够对子层输出的历史信息进行可学习、时间步自适应且非增量的聚合。此外，所提出的DAR与多种现代Transformer增强方法（如REPA）兼容。在ImageNet 256×256数据集上，DAR使SiT-XL/2的FID提升了2.11（7.56 vs. 9.67），并以8.75倍更少的训练迭代次数达到基线模型的收敛质量。当叠加在REPA之上时，DAR在早期阶段实现了2倍的训练加速，这表明跨层信息路由是扩散建模中一个尚未充分探索的设计维度，且与现有的表征对齐目标正交运行。在预训练之外，DAR还可应用于大规模文生图（T2I）模型的微调阶段，并在分布匹配蒸馏过程中保留高频细节。

English

Diffusion Transformers (DiTs) have become a de facto backbone of modern visual generation, and nearly every major axis of their design -- tokenization, attention, conditioning, objectives, and latent autoencoders -- has been extensively revisited. The residual stream that governs how information accumulates across layers, however, has been directly inherited from the original Transformer. In this paper, we present a systematic empirical analysis of cross-layer information flow in DiTs, jointly along depth and denoising timestep, and identify three concrete symptoms of traditional residual addition, namely monotonic forward magnitude inflation, sharp backward gradient decay, and pronounced block-wise redundancy. Motivated by this diagnosis, we propose Diffusion-Adaptive Routing (DAR), a drop-in residual replacement that performs learnable, timestep-adaptive, and non-incremental aggregation over the history of sublayer outputs. Moreover, the proposed DAR is compatible with many modern Transformer enhancement methods, such as REPA. On ImageNet 256times256, DAR improves SiT-XL/2 by 2.11 FID (7.56 vs.\ 9.67) and matches the baseline's converged quality with 8.75times fewer training iterations. Stacked on top of REPA, it yields a 2times training acceleration in the early stage, suggesting cross-layer information routing as an underexplored design axis in diffusion modeling, one that operates orthogonally to existing representation-alignment objectives. Beyond pretraining, DAR can also be applied during the fine-tuning stage of large-scale T2I models and preserves high-frequency details during Distribution Matching Distillation.