ChatPaper.aiChatPaper

RT-Lynx:为扩散模型正确利用GEMM稀疏性

RT-Lynx: Putting the GEMM Sparsity In a Right Way for Diffusion Models

May 26, 2026
作者: Xing Cong, Hanlin Tang, Kan Liu, Lan Tao, Lin Qu, Chenhao Xie
cs.AI

摘要

扩散变换器(DiT)在图像生成方面表现出色,但推理成本高昂。尽管先前的研究通过量化和蒸馏降低了这一成本,但半结构化稀疏性(可将近减少一半的FLOPs)仍未得到充分探索。一个关键原因是,现有大多数方法聚焦于权重稀疏化,而剪枝50%的权重会移除关键的模型容量,从而降低生成质量。然而,我们的研究表明,DiT激活值本质上是稀疏的,并且对N:M半结构化稀疏化的鲁棒性远超权重。受此观察启发,我们倡导从权重稀疏化向激活值稀疏化的范式转变。我们提出RT-Lynx,该方法将N:M稀疏化应用于激活值,并结合误差补偿技术以减轻精度损失。我们还实现了针对这一场景高度优化的CUDA内核,在线性层中平均加速比高达1.55倍。在多个扩散模型上的大量实验表明,我们的方法在保持原始模型生成质量的同时,显著加速推理过程。
English
Diffusion Transformers (DiT) achieve strong performance in image generation but incur substantial inference costs. While prior work has reduced this cost via quantization and distillation, semi-structured sparsity, which can nearly halve FLOPs, remains underexplored. A key reason is that most existing approaches focus on weight sparsification, and pruning 50% of the weights can remove critical model capacity and degrade generation quality. Our study, however, shows that DiT activations are intrinsically sparse and significantly more robust to N:M semi-structured sparsification than weights. Motivated by this observation, we advocate a paradigm shift from weight sparsification to activation sparsification. We propose RT-Lynx, which applies N:M sparsification to activations and incorporates error-compensation techniques to mitigate accuracy loss. We further implement highly optimized CUDA kernels tailored to this setting, achieving up to a 1.55x speedup on average in linear layers. Extensive experiments across multiple diffusion models demonstrate that our method preserves the generation quality of the original models while substantially accelerating inference.