RT-Lynx：以正確方式將GEMM稀疏性應用於擴散模型

摘要

擴散變壓器（DiT）在影像生成中展現優異性能，但伴隨著高昂的推理成本。儘管先前研究已透過量化與蒸餾降低此成本，能將浮點運算數（FLOPs）幾乎減半的半結構化稀疏性，仍未被充分探索。關鍵原因在於，多數現有方法聚焦於權重稀疏化，而修剪50%的權重可能移除關鍵模型容量，導致生成品質下降。然而，我們的研究表明，DiT的激活值本質上具有稀疏性，且對N:M半結構化稀疏化的穩健性遠高於權重。基於此發現，我們主張從權重稀疏化轉向激活稀疏化的典範轉移。我們提出RT-Lynx，將N:M稀疏化應用於激活值，並結合誤差補償技術以減輕準確度損失。我們進一步針對此情境實現高度最佳化的CUDA核心，在線性層中平均達到1.55倍的加速。跨多個擴散模型的廣泛實驗證明，我們的方法在大幅加速推理的同時，保留了原始模型的生成品質。

English

Diffusion Transformers (DiT) achieve strong performance in image generation but incur substantial inference costs. While prior work has reduced this cost via quantization and distillation, semi-structured sparsity, which can nearly halve FLOPs, remains underexplored. A key reason is that most existing approaches focus on weight sparsification, and pruning 50% of the weights can remove critical model capacity and degrade generation quality. Our study, however, shows that DiT activations are intrinsically sparse and significantly more robust to N:M semi-structured sparsification than weights. Motivated by this observation, we advocate a paradigm shift from weight sparsification to activation sparsification. We propose RT-Lynx, which applies N:M sparsification to activations and incorporates error-compensation techniques to mitigate accuracy loss. We further implement highly optimized CUDA kernels tailored to this setting, achieving up to a 1.55x speedup on average in linear layers. Extensive experiments across multiple diffusion models demonstrate that our method preserves the generation quality of the original models while substantially accelerating inference.