ChatPaper.aiChatPaper

EasyQuant:一種針對LLM的高效無數據量化算法

EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs

March 5, 2024
作者: Hanlin Tang, Yifu Sun, Decheng Wu, Kai Liu, Jianchen Zhu, Zhanhui Kang
cs.AI

摘要

大型語言模型(LLMs)已被證明在各種任務中比傳統方法優越得多。然而,它們昂貴的計算和高記憶體需求使其難以部署。模型量化是一種減少這種開銷的有效方法。問題在於,在大多數先前的研究中,量化模型是使用來自訓練數據的少量樣本進行校準的,這可能會影響量化LLMs對未知情況和任務的泛化。因此,在這項工作中,我們探索一個重要問題:我們是否可以設計一種針對LLMs的與數據無關的量化方法,以保證其泛化性能?在這項工作中,我們提出了EasyQuant,一種無需訓練並且與數據無關的僅權重量化算法用於LLMs。我們的觀察表明,權重和量化範圍中的離群值是減少量化誤差的關鍵。因此,在EasyQuant中,我們保留離群值(少於1%)不變並優化量化範圍以減少重建誤差。通過這些方法,我們驚訝地發現EasyQuant實現了與原始模型相當的性能。由於EasyQuant不依賴任何訓練數據,量化LLMs的泛化性能得到了安全保證。此外,EasyQuant可以並行實現,使得即使對於超過100B的LLMs,量化模型也可以在幾分鐘內獲得。據我們所知,我們是第一個在與數據無關設置下實現幾乎無損失量化性能的LLMs的工作,並且我們的算法運行速度比依賴數據的方法快10倍以上。
English
Large language models (LLMs) have proven to be very superior to conventional methods in various tasks. However, their expensive computations and high memory requirements are prohibitive for deployment. Model quantization is an effective method for reducing this overhead. The problem is that in most previous works, the quantized model was calibrated using few samples from the training data, which might affect the generalization of the quantized LLMs to unknown cases and tasks. Hence in this work, we explore an important question: Can we design a data-independent quantization method for LLMs to guarantee its generalization performance? In this work, we propose EasyQuant, a training-free and data-independent weight-only quantization algorithm for LLMs. Our observation indicates that two factors: outliers in the weight and quantization ranges, are essential for reducing the quantization error. Therefore, in EasyQuant, we leave the outliers (less than 1%) unchanged and optimize the quantization range to reduce the reconstruction error. With these methods, we surprisingly find that EasyQuant achieves comparable performance to the original model. Since EasyQuant does not depend on any training data, the generalization performance of quantized LLMs is safely guaranteed. Moreover, EasyQuant can be implemented in parallel so that the quantized model could be attained in a few minutes even for LLMs over 100B. To our best knowledge, we are the first work that achieves almost lossless quantization performance for LLMs under a data-independent setting and our algorithm runs over 10 times faster than the data-dependent methods.
PDF133December 15, 2024