低比特量化的LLaMA3模型有多好？一项实证研究

How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study

April 22, 2024

作者: Wei Huang, Xudong Ma, Haotong Qin, Xingyu Zheng, Chengtao Lv, Hong Chen, Jie Luo, Xiaojuan Qi, Xianglong Liu, Michele Magno

cs.AI

摘要

Meta的LLaMA系列已成为最强大的开源大型语言模型（LLM）系列之一。值得注意的是，LLaMA3模型最近发布，并在超大规模预训练的15T数据令牌上取得了令人瞩目的性能。鉴于在资源有限的情况下对LLMs进行低比特量化的广泛应用，我们探讨了将LLaMA3量化为低比特宽度时的能力。这一探索有潜力揭示LLaMA3和其他即将推出的LLMs在低比特量化方面的新见解和挑战，特别是在解决LLM压缩中遇到的性能下降问题方面。具体而言，我们评估了LLaMA3的10种现有后训练量化和LoRA微调方法在1-8比特和不同数据集上，全面揭示了LLaMA3的低比特量化性能。我们的实验结果表明，在这些情景下，LLaMA3仍然存在相当大的性能下降，特别是在超低比特宽度下。这突显了在未来发展中需要弥合的低比特宽度下的显著性能差距。我们期望这一经验研究将有助于推进未来模型，推动LLMs以更高准确度实现更低比特宽度的实用性。我们的项目发布在https://github.com/Macaronlin/LLaMA3-Quantization，并且量化的LLaMA3模型发布在https://huggingface.co/LLMQ。

English

Meta's LLaMA family has become one of the most powerful open-source Large Language Model (LLM) series. Notably, LLaMA3 models have recently been released and achieve impressive performance across various with super-large scale pre-training on over 15T tokens of data. Given the wide application of low-bit quantization for LLMs in resource-limited scenarios, we explore LLaMA3's capabilities when quantized to low bit-width. This exploration holds the potential to unveil new insights and challenges for low-bit quantization of LLaMA3 and other forthcoming LLMs, especially in addressing performance degradation problems that suffer in LLM compression. Specifically, we evaluate the 10 existing post-training quantization and LoRA-finetuning methods of LLaMA3 on 1-8 bits and diverse datasets to comprehensively reveal LLaMA3's low-bit quantization performance. Our experiment results indicate that LLaMA3 still suffers non-negligent degradation in these scenarios, especially in ultra-low bit-width. This highlights the significant performance gap under low bit-width that needs to be bridged in future developments. We expect that this empirical study will prove valuable in advancing future models, pushing the LLMs to lower bit-width with higher accuracy for being practical. Our project is released on https://github.com/Macaronlin/LLaMA3-Quantization and quantized LLaMA3 models are released in https://huggingface.co/LLMQ.