EasyQuant: 大規模言語モデルのための効率的なデータフリー量子化アルゴリズム

要旨

大規模言語モデル（LLMs）は、様々なタスクにおいて従来の手法を大きく凌駕することが証明されています。しかし、その高額な計算コストと膨大なメモリ要件は、実際の展開において大きな障壁となっています。モデルの量子化は、このオーバーヘッドを削減するための効果的な手法です。問題は、これまでの研究の多くでは、量子化されたモデルが訓練データのごく一部のサンプルを用いてキャリブレーションされており、これが未知のケースやタスクに対する量子化LLMの汎化性能に影響を与える可能性があることです。そこで本研究では、重要な問いを探求します：LLMの汎化性能を保証するデータ非依存の量子化手法を設計することは可能か？本論文では、訓練不要かつデータ非依存の重みのみの量子化アルゴリズムであるEasyQuantを提案します。我々の観察によると、重みと量子化範囲における外れ値の2つの要素が、量子化誤差を低減する上で重要であることがわかりました。したがって、EasyQuantでは、外れ値（1%未満）をそのまま残し、量子化範囲を最適化することで再構成誤差を低減します。これらの手法により、驚くべきことに、EasyQuantは元のモデルと同等の性能を達成することがわかりました。EasyQuantは訓練データに依存しないため、量子化されたLLMの汎化性能が安全に保証されます。さらに、EasyQuantは並列処理が可能であり、100Bを超えるLLMであっても数分で量子化モデルを取得できます。我々の知る限り、データ非依存の設定下でLLMに対してほぼロスレスな量子化性能を達成し、かつデータ依存手法よりも10倍以上高速に動作する初めての研究です。

English

Large language models (LLMs) have proven to be very superior to conventional methods in various tasks. However, their expensive computations and high memory requirements are prohibitive for deployment. Model quantization is an effective method for reducing this overhead. The problem is that in most previous works, the quantized model was calibrated using few samples from the training data, which might affect the generalization of the quantized LLMs to unknown cases and tasks. Hence in this work, we explore an important question: Can we design a data-independent quantization method for LLMs to guarantee its generalization performance? In this work, we propose EasyQuant, a training-free and data-independent weight-only quantization algorithm for LLMs. Our observation indicates that two factors: outliers in the weight and quantization ranges, are essential for reducing the quantization error. Therefore, in EasyQuant, we leave the outliers (less than 1%) unchanged and optimize the quantization range to reduce the reconstruction error. With these methods, we surprisingly find that EasyQuant achieves comparable performance to the original model. Since EasyQuant does not depend on any training data, the generalization performance of quantized LLMs is safely guaranteed. Moreover, EasyQuant can be implemented in parallel so that the quantized model could be attained in a few minutes even for LLMs over 100B. To our best knowledge, we are the first work that achieves almost lossless quantization performance for LLMs under a data-independent setting and our algorithm runs over 10 times faster than the data-dependent methods.

EasyQuant: 大規模言語モデルのための効率的なデータフリー量子化アルゴリズム

EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs

要旨

Support