VQ4DiT:用于扩散变压器的高效后训练向量量化
VQ4DiT: Efficient Post-Training Vector Quantization for Diffusion Transformers
August 30, 2024
作者: Juncan Deng, Shuaiting Li, Zeyu Wang, Hong Gu, Kedong Xu, Kejie Huang
cs.AI
摘要
扩散变压器模型(DiTs)已将网络架构从传统的UNets转变为变压器,在图像生成方面展现出卓越的能力。虽然DiTs已被广泛应用于高清视频生成任务,但其庞大的参数规模阻碍了在边缘设备上进行推断。向量量化(VQ)可以将模型权重分解为码书和分配,实现极端权重量化并显著减少内存使用。本文提出了VQ4DiT,一种用于DiTs的快速后训练向量量化方法。我们发现传统的VQ方法仅校准码书而不校准分配。这导致权重子向量被错误地分配到相同的分配,为码书提供不一致的梯度,导致次优结果。为解决这一挑战,VQ4DiT基于欧氏距离计算每个权重子向量的候选分配集,并基于加权平均重构子向量。然后,利用零数据和分块校准方法,高效选择集中的最佳分配同时校准码书。VQ4DiT在单个NVIDIA A100 GPU上将DiT XL/2模型量化为2位精度,耗时20分钟至5小时不等,具体取决于不同的量化设置。实验证明,VQ4DiT在模型大小和性能权衡方面取得了新的最佳状态,将权重量化为2位精度同时保持可接受的图像生成质量。
English
The Diffusion Transformers Models (DiTs) have transitioned the network
architecture from traditional UNets to transformers, demonstrating exceptional
capabilities in image generation. Although DiTs have been widely applied to
high-definition video generation tasks, their large parameter size hinders
inference on edge devices. Vector quantization (VQ) can decompose model weight
into a codebook and assignments, allowing extreme weight quantization and
significantly reducing memory usage. In this paper, we propose VQ4DiT, a fast
post-training vector quantization method for DiTs. We found that traditional VQ
methods calibrate only the codebook without calibrating the assignments. This
leads to weight sub-vectors being incorrectly assigned to the same assignment,
providing inconsistent gradients to the codebook and resulting in a suboptimal
result. To address this challenge, VQ4DiT calculates the candidate assignment
set for each weight sub-vector based on Euclidean distance and reconstructs the
sub-vector based on the weighted average. Then, using the zero-data and
block-wise calibration method, the optimal assignment from the set is
efficiently selected while calibrating the codebook. VQ4DiT quantizes a DiT
XL/2 model on a single NVIDIA A100 GPU within 20 minutes to 5 hours depending
on the different quantization settings. Experiments show that VQ4DiT
establishes a new state-of-the-art in model size and performance trade-offs,
quantizing weights to 2-bit precision while retaining acceptable image
generation quality.