重新思考擴散Transformer中的跨層資訊路由

摘要

扩散Transformer已成为现代视觉生成的事实标准骨干网络，其设计的几乎每个主要方面——分词、注意力机制、条件机制、目标函数和潜在自编码器——都已被广泛重新审视。然而，负责信息跨层累积的残差流却直接继承自原始Transformer。本文通过系统性的实证分析，沿着深度与去噪时间步两个维度剖析DiT中的跨层信息流动，并识别出传统残差加法的三个具体症状：单调前向幅度膨胀、剧烈反向梯度衰减以及显著的逐块冗余。基于这一诊断，我们提出扩散自适应路由（DAR）——一种即插即用的残差替代方案，可对子层输出的历史信息执行可学习、时间步自适应且非增量的聚合。此外，所提出的DAR与REPA等众多现代Transformer增强方法兼容。在ImageNet 256×256上，DAR使SiT-XL/2的FID提升2.11（7.56对比9.67），并以8.75倍更少的训练迭代达到基线模型的收敛质量。叠加在REPA之上，DAR可在训练早期实现2倍加速，这表明跨层信息路由是扩散建模中一个尚未被充分探索的设计维度，且与现有的表示对齐目标正交。除预训练外，DAR还可应用于大规模文生图模型的微调阶段，并在分布匹配蒸馏过程中保留高频细节。

English

Diffusion Transformers (DiTs) have become a de facto backbone of modern visual generation, and nearly every major axis of their design -- tokenization, attention, conditioning, objectives, and latent autoencoders -- has been extensively revisited. The residual stream that governs how information accumulates across layers, however, has been directly inherited from the original Transformer. In this paper, we present a systematic empirical analysis of cross-layer information flow in DiTs, jointly along depth and denoising timestep, and identify three concrete symptoms of traditional residual addition, namely monotonic forward magnitude inflation, sharp backward gradient decay, and pronounced block-wise redundancy. Motivated by this diagnosis, we propose Diffusion-Adaptive Routing (DAR), a drop-in residual replacement that performs learnable, timestep-adaptive, and non-incremental aggregation over the history of sublayer outputs. Moreover, the proposed DAR is compatible with many modern Transformer enhancement methods, such as REPA. On ImageNet 256times256, DAR improves SiT-XL/2 by 2.11 FID (7.56 vs.\ 9.67) and matches the baseline's converged quality with 8.75times fewer training iterations. Stacked on top of REPA, it yields a 2times training acceleration in the early stage, suggesting cross-layer information routing as an underexplored design axis in diffusion modeling, one that operates orthogonally to existing representation-alignment objectives. Beyond pretraining, DAR can also be applied during the fine-tuning stage of large-scale T2I models and preserves high-frequency details during Distribution Matching Distillation.