EasyQuant:一种用于LLMs的高效无数据量化算法
EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs
March 5, 2024
作者: Hanlin Tang, Yifu Sun, Decheng Wu, Kai Liu, Jianchen Zhu, Zhanhui Kang
cs.AI
摘要
大型语言模型(LLMs)已被证明在各种任务中比传统方法优越得多。然而,它们昂贵的计算和高内存需求阻碍了部署。模型量化是减少这种开销的有效方法。问题在于,在大多数先前的工作中,量化模型是使用来自训练数据的少量样本进行校准的,这可能会影响量化LLMs对未知情况和任务的泛化。因此,在这项工作中,我们探讨一个重要问题:我们是否可以设计一种适用于LLMs的数据无关量化方法,以确保其泛化性能?在这项工作中,我们提出EasyQuant,这是一种无需训练且数据无关的仅权重量化算法,用于LLMs。我们的观察表明,权重和量化范围中的异常值是减少量化误差的关键因素。因此,在EasyQuant中,我们保留异常值(小于1%)不变,并优化量化范围以减少重构误差。通过这些方法,我们惊讶地发现EasyQuant实现了与原始模型相媲美的性能。由于EasyQuant不依赖任何训练数据,量化LLMs的泛化性能得到了安全保证。此外,EasyQuant可以并行实现,使得即使对于超过100B的LLMs,量化模型也可以在几分钟内获得。据我们所知,我们是首个在数据无关设置下实现了几乎无损量化性能的LLMs的工作,而且我们的算法运行速度比依赖数据的方法快10倍以上。
English
Large language models (LLMs) have proven to be very superior to conventional
methods in various tasks. However, their expensive computations and high memory
requirements are prohibitive for deployment. Model quantization is an effective
method for reducing this overhead. The problem is that in most previous works,
the quantized model was calibrated using few samples from the training data,
which might affect the generalization of the quantized LLMs to unknown cases
and tasks. Hence in this work, we explore an important question: Can we design
a data-independent quantization method for LLMs to guarantee its generalization
performance? In this work, we propose EasyQuant, a training-free and
data-independent weight-only quantization algorithm for LLMs. Our observation
indicates that two factors: outliers in the weight and quantization ranges, are
essential for reducing the quantization error. Therefore, in EasyQuant, we
leave the outliers (less than 1%) unchanged and optimize the quantization range
to reduce the reconstruction error. With these methods, we surprisingly find
that EasyQuant achieves comparable performance to the original model. Since
EasyQuant does not depend on any training data, the generalization performance
of quantized LLMs is safely guaranteed. Moreover, EasyQuant can be implemented
in parallel so that the quantized model could be attained in a few minutes even
for LLMs over 100B. To our best knowledge, we are the first work that achieves
almost lossless quantization performance for LLMs under a data-independent
setting and our algorithm runs over 10 times faster than the data-dependent
methods.