벡터 양자화를 통한 텍스트-이미지 확산 모델의 정확한 압축

초록

텍스트-이미지 확산 모델은 텍스트 프롬프트를 고려할 때 고품질 이미지 생성을 위한 강력한 프레임워크로 등장했습니다. 그들의 성공은 지속적으로 크기가 커지고 이미 수십억 개의 매개변수를 포함하는 생산용 확산 모델의 신속한 발전을 견인했습니다. 결과적으로, 최첨단 텍스트-이미지 모델은 실제로는 접근하기 어려워지고 있으며 특히 자원이 제한된 환경에서는 더 그렇습니다. 사후 훈련 양자화(PTQ)는 사전 훈련된 모델 가중치를 낮은 비트 표현으로 압축하여 이 문제에 대처합니다. 최근의 확산 양자화 기술은 주로 균일 스칼라 양자화에 의존하여, 4비트로 압축된 모델에 대해 양호한 성능을 제공합니다. 본 연구는 대규모 텍스트-이미지 확산 모델에 대해 더 다양한 벡터 양자화(VQ)가 더 높은 압축률을 달성할 수 있음을 보여줍니다. 구체적으로, 우리는 최근의 수십억 개의 매개변수를 포함하는 텍스트-이미지 모델(SDXL 및 SDXL-Turbo)에 대해 벡터 기반 PTQ 방법을 맞춤화하고, VQ를 사용하여 2B+ 매개변수의 확산 모델이 이전 4비트 압축 기술과 유사한 이미지 품질과 텍스트 정렬을 보여주는 것을 보여줍니다.

English

Text-to-image diffusion models have emerged as a powerful framework for high-quality image generation given textual prompts. Their success has driven the rapid development of production-grade diffusion models that consistently increase in size and already contain billions of parameters. As a result, state-of-the-art text-to-image models are becoming less accessible in practice, especially in resource-limited environments. Post-training quantization (PTQ) tackles this issue by compressing the pretrained model weights into lower-bit representations. Recent diffusion quantization techniques primarily rely on uniform scalar quantization, providing decent performance for the models compressed to 4 bits. This work demonstrates that more versatile vector quantization (VQ) may achieve higher compression rates for large-scale text-to-image diffusion models. Specifically, we tailor vector-based PTQ methods to recent billion-scale text-to-image models (SDXL and SDXL-Turbo), and show that the diffusion models of 2B+ parameters compressed to around 3 bits using VQ exhibit the similar image quality and textual alignment as previous 4-bit compression techniques.

벡터 양자화를 통한 텍스트-이미지 확산 모델의 정확한 압축

Accurate Compression of Text-to-Image Diffusion Models via Vector Quantization

초록

Support