スプリント：効率的な拡散トランスフォーマーのためのスパース-デンス残差融合

要旨

拡散トランスフォーマー（DiT）は最先端の生成性能を実現するが、系列長に対する二次的な学習コストにより大規模事前学習は現実的でない。トークン削減は学習コストを削減できるが、単純な手法では表現が劣化し、既存手法はパラメータが過剰か高削減率で失敗する。本研究では、品質を維持しつつ大幅なトークン削減（最大75%）を可能にする簡潔な手法SPRINT（Sparse-Dense Residual Fusion for Efficient Diffusion Transformers）を提案する。SPRINTは浅層と深層の相補的役割を活用する：浅層は全トークンを処理して局所的な詳細を捕捉し、深層は疎な部分集合で演算して計算量を削減し、両者の出力は残差接続により融合される。学習は二段階で実施する：効率性を重視したマスク付き事前学習と、学習－推論ギャップを解消する短期間の全トークンファインチューニングである。ImageNet-1K 256x256において、SPRINTは同等のFID/FDDを維持しつつ学習コストを9.8倍削減し、推論時にはPath-Drop Guidance（PDG）によりFLOPsをほぼ半減させつつ品質を向上させる。これらの結果は、SPRINTが効率的なDiT学習のための簡潔かつ効果的で汎用的なソリューションであることを示す。

English

Diffusion Transformers (DiTs) deliver state-of-the-art generative performance but their quadratic training cost with sequence length makes large-scale pretraining prohibitively expensive. Token dropping can reduce training cost, yet na\"ive strategies degrade representations, and existing methods are either parameter-heavy or fail at high drop ratios. We present SPRINT, Sparse--Dense Residual Fusion for Efficient Diffusion Transformers, a simple method that enables aggressive token dropping (up to 75%) while preserving quality. SPRINT leverages the complementary roles of shallow and deep layers: early layers process all tokens to capture local detail, deeper layers operate on a sparse subset to cut computation, and their outputs are fused through residual connections. Training follows a two-stage schedule: long masked pre-training for efficiency followed by short full-token fine-tuning to close the train--inference gap. On ImageNet-1K 256x256, SPRINT achieves 9.8x training savings with comparable FID/FDD, and at inference, its Path-Drop Guidance (PDG) nearly halves FLOPs while improving quality. These results establish SPRINT as a simple, effective, and general solution for efficient DiT training.

スプリント：効率的な拡散トランスフォーマーのためのスパース-デンス残差融合

Sprint: Sparse-Dense Residual Fusion for Efficient Diffusion Transformers

要旨

Support