any4: LLM向けの4ビット数値表現学習

要旨

本論文では、大規模言語モデル（LLMs）向けの4ビット重み量子化ソリューションであるany4を提案する。any4は、重みや活性化の前処理を必要とせず、任意の数値表現を提供する。any4は、様々なモデルサイズ、世代、ファミリー（Llama 2、Llama 3、Mistral、Mixtral）において、他の関連する4ビット数値表現タイプ（int4、fp4、nf4）と比較して高い精度を実現する。any4は重みや活性化の前処理を必要としないが、そのような前処理を必要とする直交技術（例：AWQやGPTQ）とも競争力がある。また、any3やany2についても実験を行い、低ビットにおいても競争力があることを示す。さらに、ほとんどの量子化アプローチで行われるようにデータセットから数百のサンプルを使用するのではなく、単一の精選された多様なサンプルを使用してキャリブレーションできることを示す。また、LLM向けのレイテンシ最適化GPU行列乗算ライブラリであるtinygemmをオープンソースとして公開する。tinygemmは、GPU効率的なルックアップテーブル戦略を使用してany4を実装し、他の一般的な量子化方法もサポートする。コードはhttps://github.com/facebookresearch/any4で公開している。

English

We present any4, a learned 4-bit weight quantization solution for large language models (LLMs) providing arbitrary numeric representations without requiring pre-processing of weights or activations. any4 yields higher accuracy compared to other related 4-bit numeric representation types: int4, fp4 and nf4, as evaluated on a range of model sizes, generations and families (Llama 2, Llama 3, Mistral and Mixtral). While any4 does not require preprocessing of weights or activations, it is also competitive with orthogonal techniques that require such preprocessing (e.g., AWQ and GPTQ). We also experiment with any3 and any2 and show competitiveness at lower bits. Additionally, we show that we can calibrate using a single curated diverse sample rather than hundreds of samples from a dataset as done in most quantization approaches. We also open source tinygemm, a latency optimized GPU matrix multiplication library for LLMs, that implements any4 using a GPU-efficient lookup table strategy along with other common quantization methods. We open source our code at https://github.com/facebookresearch/any4 .

any4: LLM向けの4ビット数値表現学習

any4: Learned 4-bit Numeric Representation for LLMs

要旨

Support