RT-Lynx:以正確方式將GEMM稀疏性應用於擴散模型
RT-Lynx: Putting the GEMM Sparsity In a Right Way for Diffusion Models
May 26, 2026
作者: Xing Cong, Hanlin Tang, Kan Liu, Lan Tao, Lin Qu, Chenhao Xie
cs.AI
摘要
擴散變壓器(DiT)在影像生成中展現優異性能,但伴隨著高昂的推理成本。儘管先前研究已透過量化與蒸餾降低此成本,能將浮點運算數(FLOPs)幾乎減半的半結構化稀疏性,仍未被充分探索。關鍵原因在於,多數現有方法聚焦於權重稀疏化,而修剪50%的權重可能移除關鍵模型容量,導致生成品質下降。然而,我們的研究表明,DiT的激活值本質上具有稀疏性,且對N:M半結構化稀疏化的穩健性遠高於權重。基於此發現,我們主張從權重稀疏化轉向激活稀疏化的典範轉移。我們提出RT-Lynx,將N:M稀疏化應用於激活值,並結合誤差補償技術以減輕準確度損失。我們進一步針對此情境實現高度最佳化的CUDA核心,在線性層中平均達到1.55倍的加速。跨多個擴散模型的廣泛實驗證明,我們的方法在大幅加速推理的同時,保留了原始模型的生成品質。
English
Diffusion Transformers (DiT) achieve strong performance in image generation but incur substantial inference costs. While prior work has reduced this cost via quantization and distillation, semi-structured sparsity, which can nearly halve FLOPs, remains underexplored. A key reason is that most existing approaches focus on weight sparsification, and pruning 50% of the weights can remove critical model capacity and degrade generation quality. Our study, however, shows that DiT activations are intrinsically sparse and significantly more robust to N:M semi-structured sparsification than weights. Motivated by this observation, we advocate a paradigm shift from weight sparsification to activation sparsification. We propose RT-Lynx, which applies N:M sparsification to activations and incorporates error-compensation techniques to mitigate accuracy loss. We further implement highly optimized CUDA kernels tailored to this setting, achieving up to a 1.55x speedup on average in linear layers. Extensive experiments across multiple diffusion models demonstrate that our method preserves the generation quality of the original models while substantially accelerating inference.