低ビット量子化されたLLaMA3モデルの性能はどの程度か？実証的研究

要旨

MetaのLLaMAファミリーは、最も強力なオープンソースの大規模言語モデル（LLM）シリーズの一つとなっています。特に、最近リリースされたLLaMA3モデルは、15T以上のトークンデータを用いた超大規模な事前学習により、さまざまなタスクで印象的な性能を達成しています。リソースが限られたシナリオでのLLMの低ビット量子化の広範な応用を考慮し、我々はLLaMA3を低ビット幅に量子化した場合の能力を探求します。この探求は、LLaMA3や今後登場する他のLLMの低ビット量子化における新たな洞察と課題を明らかにする可能性を秘めており、特にLLM圧縮における性能低下問題の解決に役立つと考えられます。具体的には、1～8ビットの範囲で10種類の既存の学習後量子化およびLoRAファインチューニング手法をLLaMA3に適用し、多様なデータセットでその低ビット量子化性能を包括的に評価します。実験結果から、LLaMA3は特に超低ビット幅において無視できない性能低下を引き起こすことが明らかになりました。これは、低ビット幅における重要な性能ギャップが今後の開発で埋められる必要があることを示しています。我々は、この実証研究が将来のモデルの進化に貢献し、LLMをより低ビット幅で高精度に実用化するための推進力となることを期待しています。本プロジェクトはhttps://github.com/Macaronlin/LLaMA3-Quantizationで公開されており、量子化されたLLaMA3モデルはhttps://huggingface.co/LLMQでリリースされています。

English

Meta's LLaMA family has become one of the most powerful open-source Large Language Model (LLM) series. Notably, LLaMA3 models have recently been released and achieve impressive performance across various with super-large scale pre-training on over 15T tokens of data. Given the wide application of low-bit quantization for LLMs in resource-limited scenarios, we explore LLaMA3's capabilities when quantized to low bit-width. This exploration holds the potential to unveil new insights and challenges for low-bit quantization of LLaMA3 and other forthcoming LLMs, especially in addressing performance degradation problems that suffer in LLM compression. Specifically, we evaluate the 10 existing post-training quantization and LoRA-finetuning methods of LLaMA3 on 1-8 bits and diverse datasets to comprehensively reveal LLaMA3's low-bit quantization performance. Our experiment results indicate that LLaMA3 still suffers non-negligent degradation in these scenarios, especially in ultra-low bit-width. This highlights the significant performance gap under low bit-width that needs to be bridged in future developments. We expect that this empirical study will prove valuable in advancing future models, pushing the LLMs to lower bit-width with higher accuracy for being practical. Our project is released on https://github.com/Macaronlin/LLaMA3-Quantization and quantized LLaMA3 models are released in https://huggingface.co/LLMQ.

低ビット量子化されたLLaMA3モデルの性能はどの程度か？実証的研究

How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study

要旨

Support