冲刺:基于稀疏-稠密残差融合的高效扩散Transformer
Sprint: Sparse-Dense Residual Fusion for Efficient Diffusion Transformers
October 24, 2025
作者: Dogyun Park, Moayed Haji-Ali, Yanyu Li, Willi Menapace, Sergey Tulyakov, Hyunwoo J. Kim, Aliaksandr Siarohin, Anil Kag
cs.AI
摘要
扩散变换器(DiT)虽能实现顶尖的生成效果,但其随序列长度呈平方级增长的训练成本使得大规模预训练极其昂贵。令牌丢弃技术可降低训练成本,但朴素策略会损害表征质量,现有方法要么参数量庞大,要么无法适应高丢弃率。我们提出SPRINT——高效扩散变换器的稀疏-稠密残差融合方法,该简单方案能在保留生成质量的同时实现激进式令牌丢弃(最高达75%)。SPRINT利用浅层与深层网络的互补特性:浅层处理全部令牌以捕捉局部细节,深层仅计算稀疏令牌子集以减少运算量,二者输出通过残差连接进行融合。训练采用两阶段策略:先进行长时掩码预训练以提升效率,再通过短时全令牌微调弥合训练-推理差距。在ImageNet-1K 256×256数据集上,SPRINT在保持相当FID/FDD指标的同时实现9.8倍训练加速,其推理阶段采用的路径丢弃引导(PDG)机制在提升生成质量的同时将FLOPs削减近半。这些结果表明SPRINT为高效DiT训练提供了一种简洁、有效且通用的解决方案。
English
Diffusion Transformers (DiTs) deliver state-of-the-art generative performance
but their quadratic training cost with sequence length makes large-scale
pretraining prohibitively expensive. Token dropping can reduce training cost,
yet na\"ive strategies degrade representations, and existing methods are either
parameter-heavy or fail at high drop ratios. We present SPRINT, Sparse--Dense
Residual Fusion for Efficient Diffusion Transformers, a simple method that
enables aggressive token dropping (up to 75%) while preserving quality. SPRINT
leverages the complementary roles of shallow and deep layers: early layers
process all tokens to capture local detail, deeper layers operate on a sparse
subset to cut computation, and their outputs are fused through residual
connections. Training follows a two-stage schedule: long masked pre-training
for efficiency followed by short full-token fine-tuning to close the
train--inference gap. On ImageNet-1K 256x256, SPRINT achieves 9.8x training
savings with comparable FID/FDD, and at inference, its Path-Drop Guidance (PDG)
nearly halves FLOPs while improving quality. These results establish SPRINT as
a simple, effective, and general solution for efficient DiT training.