낮은 비트 양자화된 LLaMA3 모델의 성능은 어떨까? 실증적 연구

초록

메타의 LLaMA 패밀리는 가장 강력한 오픈소스 대규모 언어 모델(LLM) 시리즈 중 하나로 자리 잡았습니다. 특히, 최근 출시된 LLaMA3 모델들은 15조 토큰 이상의 초대규모 데이터로 사전 학습을 진행하여 다양한 분야에서 인상적인 성능을 달성했습니다. 자원이 제한된 환경에서 LLM에 대한 저비트 양자화의 광범위한 적용을 고려할 때, 우리는 LLaMA3가 저비트 폭으로 양자화되었을 때의 능력을 탐구합니다. 이 탐구는 LLaMA3 및 향후 출시될 다른 LLM들의 저비트 양자화에 대한 새로운 통찰과 과제를 밝혀낼 잠재력을 가지고 있으며, 특히 LLM 압축에서 겪는 성능 저하 문제를 해결하는 데 기여할 수 있습니다. 구체적으로, 우리는 1-8비트 범위와 다양한 데이터셋에서 LLaMA3의 10가지 기존 사후 학습 양자화 및 LoRA 파인튜닝 방법을 평가하여 LLaMA3의 저비트 양자화 성능을 포괄적으로 분석합니다. 실험 결과에 따르면, LLaMA3는 특히 초저비트 폭에서 이러한 시나리오에서 무시할 수 없는 성능 저하를 겪는 것으로 나타났습니다. 이는 향후 개발에서 해결해야 할 저비트 폭에서의 상당한 성능 격차를 강조합니다. 우리는 이 실증적 연구가 향후 모델의 발전에 기여하고, LLM이 더 낮은 비트 폭에서도 더 높은 정확도를 유지하며 실용적으로 사용될 수 있도록 하는 데 가치가 있을 것으로 기대합니다. 우리의 프로젝트는 https://github.com/Macaronlin/LLaMA3-Quantization에서 공개되었으며, 양자화된 LLaMA3 모델은 https://huggingface.co/LLMQ에서 확인할 수 있습니다.

English

Meta's LLaMA family has become one of the most powerful open-source Large Language Model (LLM) series. Notably, LLaMA3 models have recently been released and achieve impressive performance across various with super-large scale pre-training on over 15T tokens of data. Given the wide application of low-bit quantization for LLMs in resource-limited scenarios, we explore LLaMA3's capabilities when quantized to low bit-width. This exploration holds the potential to unveil new insights and challenges for low-bit quantization of LLaMA3 and other forthcoming LLMs, especially in addressing performance degradation problems that suffer in LLM compression. Specifically, we evaluate the 10 existing post-training quantization and LoRA-finetuning methods of LLaMA3 on 1-8 bits and diverse datasets to comprehensively reveal LLaMA3's low-bit quantization performance. Our experiment results indicate that LLaMA3 still suffers non-negligent degradation in these scenarios, especially in ultra-low bit-width. This highlights the significant performance gap under low bit-width that needs to be bridged in future developments. We expect that this empirical study will prove valuable in advancing future models, pushing the LLMs to lower bit-width with higher accuracy for being practical. Our project is released on https://github.com/Macaronlin/LLaMA3-Quantization and quantized LLaMA3 models are released in https://huggingface.co/LLMQ.

낮은 비트 양자화된 LLaMA3 모델의 성능은 어떨까? 실증적 연구

How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study

초록

Support