RT-Lynx: 拡散モデルのためのGEMMスパース性の適切な活用

要旨

拡散トランスフォーマー（DiT）は画像生成において優れた性能を示すが、推論コストが大きい。従来研究では量子化や蒸留によりこのコストを削減してきたが、FLOPSをほぼ半減できる半構造化スパース性は未だ十分に検討されていない。主な理由として、既存手法の大半が重みのスパース化に着目しており、重みの50%を刈り込むとモデルの重要な容量が失われ、生成品質が低下するためである。しかし本研究では、DiTの活性化が本質的にスパースであり、重みよりもN:M半構造化スパース化に対して格段に頑健であることを示す。この知見に基づき、我々は重みのスパース化から活性化のスパース化へのパラダイムシフトを提唱する。提案手法RT-Lynxは、活性化にN:Mスパース化を適用し、精度低下を緩和するための誤差補償技術を組み込む。さらに、この設定に特化した高度に最適化されたCUDAカーネルを実装し、線形層において平均1.55倍の高速化を達成する。複数の拡散モデルにわたる大規模実験により、本手法が元モデルの生成品質を維持しつつ、推論を大幅に高速化することを実証する。

English

Diffusion Transformers (DiT) achieve strong performance in image generation but incur substantial inference costs. While prior work has reduced this cost via quantization and distillation, semi-structured sparsity, which can nearly halve FLOPs, remains underexplored. A key reason is that most existing approaches focus on weight sparsification, and pruning 50% of the weights can remove critical model capacity and degrade generation quality. Our study, however, shows that DiT activations are intrinsically sparse and significantly more robust to N:M semi-structured sparsification than weights. Motivated by this observation, we advocate a paradigm shift from weight sparsification to activation sparsification. We propose RT-Lynx, which applies N:M sparsification to activations and incorporates error-compensation techniques to mitigate accuracy loss. We further implement highly optimized CUDA kernels tailored to this setting, achieving up to a 1.55x speedup on average in linear layers. Extensive experiments across multiple diffusion models demonstrate that our method preserves the generation quality of the original models while substantially accelerating inference.