拡散トランスフォーマーにおける層間情報ルーティングの再考

要旨

拡散トランスフォーマーは現代のビジュアル生成において事実上の基盤アーキテクチャとなっており、トークン化、アテンション、条件付け、目的関数、潜在オートエンコーダといった設計上の主要な軸のほぼすべてが広範に再検討されてきた。しかしながら、層間で情報が蓄積される仕組みを司る残差ストリームは、オリジナルのTransformerから直接継承されたままである。本論文では、DiTにおける層間情報フローの系統的実証分析を深さ方向とノイズ除去タイムステップ方向の両方に沿って行い、従来の残差加算に固有の三つの具体的な兆候、すなわち単調な順方向の大きさの増大、急峻な逆方向の勾配減衰、顕著なブロック単位の冗長性を特定する。この診断結果に基づき、我々はDiffusion-Adaptive Routing（DAR）を提案する。これはドロップインで置き換え可能な残差構造であり、サブレイヤ出力の履歴に対して学習可能、タイムステップ適応的、かつ非増分的な集約を実行する。さらに、提案するDARはREPAなどの多くの現代的なTransformer拡張手法と互換性がある。ImageNet 256×256において、DARはSiT-XL/2のFIDを2.11改善し（9.67から7.56へ）、ベースラインと同等の品質を8.75分の1の学習イテレーションで達成する。REPAと組み合わせた場合、初期段階で2倍の学習高速化をもたらし、これは拡散モデリングにおける未開拓の設計軸としての層間情報ルーティングが、既存の表現整合目的関数とは直交して機能することを示唆している。事前学習に加えて、DARは大規模T2Iモデルのファインチューニング段階にも適用可能であり、Distribution Matching Distillationにおいて高周波の詳細情報を保持する。

English

Diffusion Transformers (DiTs) have become a de facto backbone of modern visual generation, and nearly every major axis of their design -- tokenization, attention, conditioning, objectives, and latent autoencoders -- has been extensively revisited. The residual stream that governs how information accumulates across layers, however, has been directly inherited from the original Transformer. In this paper, we present a systematic empirical analysis of cross-layer information flow in DiTs, jointly along depth and denoising timestep, and identify three concrete symptoms of traditional residual addition, namely monotonic forward magnitude inflation, sharp backward gradient decay, and pronounced block-wise redundancy. Motivated by this diagnosis, we propose Diffusion-Adaptive Routing (DAR), a drop-in residual replacement that performs learnable, timestep-adaptive, and non-incremental aggregation over the history of sublayer outputs. Moreover, the proposed DAR is compatible with many modern Transformer enhancement methods, such as REPA. On ImageNet 256times256, DAR improves SiT-XL/2 by 2.11 FID (7.56 vs.\ 9.67) and matches the baseline's converged quality with 8.75times fewer training iterations. Stacked on top of REPA, it yields a 2times training acceleration in the early stage, suggesting cross-layer information routing as an underexplored design axis in diffusion modeling, one that operates orthogonally to existing representation-alignment objectives. Beyond pretraining, DAR can also be applied during the fine-tuning stage of large-scale T2I models and preserves high-frequency details during Distribution Matching Distillation.