확산 트랜스포머에서의 교차 계층 정보 라우팅 재검토

초록

확산 트랜스포머(DiTs)는 현대 시각 생성의 사실상의 백본이 되었으며, 토큰화, 어텐션, 조건화, 목적 함수, 잠재 오토인코더 등 설계의 거의 모든 주요 축이 광범위하게 재검토되었다. 그러나 층 간 정보가 축적되는 방식을 결정하는 잔차 스트림은 원래 트랜스포머로부터 직접 계승되었다. 본 논문에서는 DiTs의 층 간 정보 흐름에 대해 깊이와 노이즈 제거 시간 단계를 함께 고려한 체계적인 실증 분석을 수행하고, 전통적인 잔차 덧셈의 세 가지 구체적인 증상, 즉 단조로운 순방향 크기 팽창, 급격한 역방향 기울기 감소, 뚜렷한 블록 단위 중복성을 식별한다. 이러한 진단에 기초하여, 우리는 학습 가능하고 시간 단계에 적응적이며 비증분적인 방식으로 하위층 출력의 이력을 집계하는 드롭인 잔차 대체 기법인 확산 적응형 라우팅(DAR)을 제안한다. 또한 제안된 DAR은 REPA와 같은 많은 현대 트랜스포머 개선 방법과 호환된다. ImageNet 256×256에서 DAR은 SiT-XL/2의 FID를 2.11만큼 개선했으며(7.56 대 9.67), 8.75배 적은 훈련 반복 횟수로 기준 모델의 수렴된 품질에 도달했다. REPA 위에 적용하면 초기 단계에서 2배의 훈련 가속을 제공하며, 이는 확산 모델링에서 층 간 정보 라우팅이 기존의 표현 정렬 목표와 직교적으로 작동하는 아직 충분히 탐구되지 않은 설계 축임을 시사한다. 사전 훈련 외에도 DAR은 대규모 T2I 모델의 미세 조정 단계에서 적용될 수 있으며, 분포 정합 증류 과정에서 고주파 세부 정보를 보존한다.

English

Diffusion Transformers (DiTs) have become a de facto backbone of modern visual generation, and nearly every major axis of their design -- tokenization, attention, conditioning, objectives, and latent autoencoders -- has been extensively revisited. The residual stream that governs how information accumulates across layers, however, has been directly inherited from the original Transformer. In this paper, we present a systematic empirical analysis of cross-layer information flow in DiTs, jointly along depth and denoising timestep, and identify three concrete symptoms of traditional residual addition, namely monotonic forward magnitude inflation, sharp backward gradient decay, and pronounced block-wise redundancy. Motivated by this diagnosis, we propose Diffusion-Adaptive Routing (DAR), a drop-in residual replacement that performs learnable, timestep-adaptive, and non-incremental aggregation over the history of sublayer outputs. Moreover, the proposed DAR is compatible with many modern Transformer enhancement methods, such as REPA. On ImageNet 256times256, DAR improves SiT-XL/2 by 2.11 FID (7.56 vs.\ 9.67) and matches the baseline's converged quality with 8.75times fewer training iterations. Stacked on top of REPA, it yields a 2times training acceleration in the early stage, suggesting cross-layer information routing as an underexplored design axis in diffusion modeling, one that operates orthogonally to existing representation-alignment objectives. Beyond pretraining, DAR can also be applied during the fine-tuning stage of large-scale T2I models and preserves high-frequency details during Distribution Matching Distillation.