ChatPaper.aiChatPaper

冲刺式扩散:基于稀疏-稠密残差融合的高效扩散Transformer模型

Sprint: Sparse-Dense Residual Fusion for Efficient Diffusion Transformers

October 24, 2025
作者: Dogyun Park, Moayed Haji-Ali, Yanyu Li, Willi Menapace, Sergey Tulyakov, Hyunwoo J. Kim, Aliaksandr Siarohin, Anil Kag
cs.AI

摘要

扩散变换器(DiTs)虽能实现最先进的生成性能,但其随序列长度呈平方级增长的计算成本使得大规模预训练代价高昂。令牌丢弃技术可降低训练成本,但简单策略会损害表征质量,现有方法要么参数量庞大,要么无法适应高丢弃率。我们提出SPRINT——面向高效扩散变换器的稀疏-稠密残差融合方法,该方案通过激进式令牌丢弃(最高达75%)仍能保持生成质量。SPRINT利用浅层与深层网络的互补特性:浅层处理全部令牌以捕捉局部细节,深层仅对稀疏令牌子集运算以减少计算量,并通过残差连接融合二者输出。训练采用两阶段策略:先进行长时掩码预训练以提升效率,再通过短时全令牌微调弥合训练-推断差距。在ImageNet-1K 256×256数据集上,SPRINT在保持相当FID/FDD指标的同时实现9.8倍训练加速,其推断阶段采用的路径丢弃引导(PDG)技术可在提升质量的同时将FLOPs削减近半。这些成果表明SPRINT为高效DiT训练提供了一种简洁、有效且通用的解决方案。
English
Diffusion Transformers (DiTs) deliver state-of-the-art generative performance but their quadratic training cost with sequence length makes large-scale pretraining prohibitively expensive. Token dropping can reduce training cost, yet na\"ive strategies degrade representations, and existing methods are either parameter-heavy or fail at high drop ratios. We present SPRINT, Sparse--Dense Residual Fusion for Efficient Diffusion Transformers, a simple method that enables aggressive token dropping (up to 75%) while preserving quality. SPRINT leverages the complementary roles of shallow and deep layers: early layers process all tokens to capture local detail, deeper layers operate on a sparse subset to cut computation, and their outputs are fused through residual connections. Training follows a two-stage schedule: long masked pre-training for efficiency followed by short full-token fine-tuning to close the train--inference gap. On ImageNet-1K 256x256, SPRINT achieves 9.8x training savings with comparable FID/FDD, and at inference, its Path-Drop Guidance (PDG) nearly halves FLOPs while improving quality. These results establish SPRINT as a simple, effective, and general solution for efficient DiT training.
PDF31December 31, 2025