量化大型語言模型用於代碼生成:一項差異化複製研究
Quantizing Large Language Models for Code Generation: A Differentiated Replication
March 10, 2025
作者: Alessandro Giagnorio, Antonio Mastropaolo, Saima Afrin, Massimiliano Di Penta, Gabriele Bavota
cs.AI
摘要
大型語言模型(LLMs)在代碼生成方面展現了令人印象深刻的能力,尤其是在自動實現自然語言描述的需求方面。LLM的效果通常隨著其規模的增大而提升:可訓練參數數量越多,其實現代碼的能力越強。然而,當涉及到部署基於LLM的代碼生成器時,更大的LLM會帶來與其記憶體(以及隨之而來的碳)足跡相關的重大挑戰。Wei等人先前的工作提出利用量化技術來減少基於LLM的代碼生成器的記憶體佔用,而不顯著降低其效果。簡而言之,他們研究了具有高達160億參數的LLM,將其精度從32位浮點數量化至8位整數,並展示了這對代碼生成性能的有限影響。考慮到LLM能力和量化技術的快速發展,在本研究中,我們對Wei等人的工作進行了差異化複製,其中我們考慮了:(i) 一方面,更新且更大的與代碼相關的LLM,參數高達340億;(ii) 模型量化技術的最新進展,允許將壓縮推向每個模型參數2位的極端量化水平;以及(iii) 不同類型的校準數據集來指導量化過程,包括特定於代碼的數據集。我們的實證評估揭示,LLM量化的新前沿是4位精度,與原始模型相比,平均記憶體佔用減少了70%,且未觀察到性能的顯著下降。此外,當量化變得更為極端(3位和2位)時,特定於代碼的校準數據集有助於限制性能的損失。
English
Large Language Models (LLMs) have shown an impressive capability in code
generation and, specifically, to automatically implement requirements described
in natural language. The LLM effectiveness generally increases with its size:
The higher the number of LLM's trainable parameters the better its ability to
implement code. However, when it comes to deploying LLM-based code generators,
larger LLMs pose significant challenges related to their memory (and,
consequently, carbon) footprint. A previous work by Wei et al. proposed to
leverage quantization techniques to reduce the memory footprint of LLM-based
code generators without substantially degrading their effectiveness. In short,
they studied LLMs featuring up to 16B parameters, quantizing their precision
from floating point 32 bits down to int 8 bits and showing their limited impact
on code generation performance. Given the fast pace at which LLM capabilities
and quantization techniques are evolving, in this work we present a
differentiated replication of the work by Wei et al. in which we consider (i)
on the one side, more recent and larger code-related LLMs, of up to 34B
parameters; (ii) the latest advancements in model quantization techniques,
which allow pushing the compression to the extreme quantization level of 2 bits
per model parameter and; (iii) different types of calibration datasets to guide
the quantization process, including code-specific ones. Our empirical
evaluation reveals that the new frontier for LLM quantization is 4-bit
precision, resulting in an average memory footprint reduction of 70% compared
to the original model without observing any significant decrease in
performance. Additionally, when the quantization becomes even more extreme (3
and 2 bits), a code-specific calibration dataset helps to limit the loss of
performance.Summary
AI-Generated Summary