RT-Lynx：为扩散模型正确利用GEMM稀疏性

摘要

扩散变换器（DiT）在图像生成方面表现出色，但推理成本高昂。尽管先前的研究通过量化和蒸馏降低了这一成本，但半结构化稀疏性（可将近减少一半的FLOPs）仍未得到充分探索。一个关键原因是，现有大多数方法聚焦于权重稀疏化，而剪枝50%的权重会移除关键的模型容量，从而降低生成质量。然而，我们的研究表明，DiT激活值本质上是稀疏的，并且对N:M半结构化稀疏化的鲁棒性远超权重。受此观察启发，我们倡导从权重稀疏化向激活值稀疏化的范式转变。我们提出RT-Lynx，该方法将N:M稀疏化应用于激活值，并结合误差补偿技术以减轻精度损失。我们还实现了针对这一场景高度优化的CUDA内核，在线性层中平均加速比高达1.55倍。在多个扩散模型上的大量实验表明，我们的方法在保持原始模型生成质量的同时，显著加速推理过程。

English

Diffusion Transformers (DiT) achieve strong performance in image generation but incur substantial inference costs. While prior work has reduced this cost via quantization and distillation, semi-structured sparsity, which can nearly halve FLOPs, remains underexplored. A key reason is that most existing approaches focus on weight sparsification, and pruning 50% of the weights can remove critical model capacity and degrade generation quality. Our study, however, shows that DiT activations are intrinsically sparse and significantly more robust to N:M semi-structured sparsification than weights. Motivated by this observation, we advocate a paradigm shift from weight sparsification to activation sparsification. We propose RT-Lynx, which applies N:M sparsification to activations and incorporates error-compensation techniques to mitigate accuracy loss. We further implement highly optimized CUDA kernels tailored to this setting, achieving up to a 1.55x speedup on average in linear layers. Extensive experiments across multiple diffusion models demonstrate that our method preserves the generation quality of the original models while substantially accelerating inference.