低位元量化的LLaMA3模型有多好？一項實證研究

摘要

Meta 的 LLaMA 系列已成為最強大的開源大型語言模型 (LLM) 系列之一。值得注意的是，LLaMA3 模型最近已經釋出，通過對超過 15T 標記數據進行超大規模預訓練，取得了令人印象深刻的性能。考慮到在資源有限情況下對 LLM 進行低位量化的廣泛應用，我們探索了將 LLaMA3 量化為低位寬時的能力。這一探索有望揭示 LLaMA3 和其他即將推出的 LLM 在低位量化方面的新見解和挑戰，特別是在解決在 LLM 壓縮中遭受的性能降級問題方面。具體而言，我們對 LLaMA3 的 10 種現有後訓練量化和 LoRA 微調方法在 1-8 位和不同數據集上進行評估，以全面揭示 LLaMA3 的低位量化性能。我們的實驗結果顯示，在這些情況下，LLaMA3 仍然存在相當大的性能降級，特別是在超低位寬時。這突顯了在低位寬下需要在未來發展中彌合的顯著性能差距。我們期望這一實證研究將有助於推進未來模型，將 LLM 推向更低的位寬，以實現更高的準確性。我們的項目已在 https://github.com/Macaronlin/LLaMA3-Quantization 上釋出，而量化的 LLaMA3 模型已在 https://huggingface.co/LLMQ 上釋出。

English

Meta's LLaMA family has become one of the most powerful open-source Large Language Model (LLM) series. Notably, LLaMA3 models have recently been released and achieve impressive performance across various with super-large scale pre-training on over 15T tokens of data. Given the wide application of low-bit quantization for LLMs in resource-limited scenarios, we explore LLaMA3's capabilities when quantized to low bit-width. This exploration holds the potential to unveil new insights and challenges for low-bit quantization of LLaMA3 and other forthcoming LLMs, especially in addressing performance degradation problems that suffer in LLM compression. Specifically, we evaluate the 10 existing post-training quantization and LoRA-finetuning methods of LLaMA3 on 1-8 bits and diverse datasets to comprehensively reveal LLaMA3's low-bit quantization performance. Our experiment results indicate that LLaMA3 still suffers non-negligent degradation in these scenarios, especially in ultra-low bit-width. This highlights the significant performance gap under low bit-width that needs to be bridged in future developments. We expect that this empirical study will prove valuable in advancing future models, pushing the LLMs to lower bit-width with higher accuracy for being practical. Our project is released on https://github.com/Macaronlin/LLaMA3-Quantization and quantized LLaMA3 models are released in https://huggingface.co/LLMQ.

低位元量化的LLaMA3模型有多好？一項實證研究

How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study

摘要

Support