RT-Lynx: 확산 모델에서 GEMM 희소성을 올바르게 활용하기

초록

Diffusion Transformers (DiT)는 이미지 생성에서 뛰어난 성능을 보이지만 상당한 추론 비용이 발생한다. 이전 연구에서는 양자화 및 증류(distillation)를 통해 이 비용을 절감했으나, FLOPs를 거의 절반으로 줄일 수 있는 반구조적 희소성(semi-structured sparsity)은 여전히 충분히 탐구되지 않았다. 주요 이유는 대부분의 기존 접근 방식이 가중치 희소화(weight sparsification)에 초점을 맞추고 있으며, 가중치의 50%를 제거하면 모델의 핵심 용량이 손실되어 생성 품질이 저하될 수 있기 때문이다. 그러나 본 연구는 DiT 활성화(activations)가 본질적으로 희소하며, N:M 반구조적 희소화에 가중치보다 훨씬 더 강건함을 보여준다. 이러한 관찰에 기반하여, 우리는 가중치 희소화에서 활성화 희소화(activation sparsification)로의 패러다임 전환을 주장한다. 본 논문에서는 RT-Lynx를 제안하며, 이는 활성화에 N:M 희소화를 적용하고 정확도 손실을 완화하기 위한 오차 보상 기법을 통합한다. 또한 이 설정에 맞게 고도로 최적화된 CUDA 커널을 구현하여 선형 레이어에서 평균 최대 1.55배의 속도 향상을 달성한다. 다양한 확산 모델에 걸친 광범위한 실험을 통해 본 방법이 원본 모델의 생성 품질을 유지하면서 추론을 상당히 가속화함을 입증한다.

English

Diffusion Transformers (DiT) achieve strong performance in image generation but incur substantial inference costs. While prior work has reduced this cost via quantization and distillation, semi-structured sparsity, which can nearly halve FLOPs, remains underexplored. A key reason is that most existing approaches focus on weight sparsification, and pruning 50% of the weights can remove critical model capacity and degrade generation quality. Our study, however, shows that DiT activations are intrinsically sparse and significantly more robust to N:M semi-structured sparsification than weights. Motivated by this observation, we advocate a paradigm shift from weight sparsification to activation sparsification. We propose RT-Lynx, which applies N:M sparsification to activations and incorporates error-compensation techniques to mitigate accuracy loss. We further implement highly optimized CUDA kernels tailored to this setting, achieving up to a 1.55x speedup on average in linear layers. Extensive experiments across multiple diffusion models demonstrate that our method preserves the generation quality of the original models while substantially accelerating inference.