コード生成のための大規模言語モデルの量子化：差別化された再現

要旨

大規模言語モデル（LLM）は、コード生成において特に自然言語で記述された要件を自動的に実装する能力において、印象的な性能を示しています。一般に、LLMの有効性はそのサイズに比例して向上します。つまり、LLMの学習可能なパラメータ数が多ければ多いほど、コードを実装する能力が高まります。しかし、LLMベースのコードジェネレータをデプロイする際には、より大規模なLLMはメモリ（および結果としてのカーボン）フットプリントに関連する重大な課題を引き起こします。Weiらによる以前の研究では、量子化技術を活用してLLMベースのコードジェネレータのメモリフットプリントを削減しつつ、その有効性を大幅に低下させない方法を提案しました。簡単に言えば、彼らは最大16BパラメータのLLMを対象に、その精度を浮動小数点32ビットから整数8ビットに量子化し、コード生成性能への影響が限定的であることを示しました。LLMの能力と量子化技術が急速に進化していることを踏まえ、本研究ではWeiらの研究を発展的に再現し、(i) より新しく大規模なコード関連LLM（最大34Bパラメータ）、(ii) モデルパラメータあたり2ビットという極端な量子化レベルまで圧縮を可能にする最新の量子化技術の進展、および (iii) 量子化プロセスを導くためのコード固有のキャリブレーションデータセットを含む異なるタイプのキャリブレーションデータセットを検討しました。我々の実証的評価によると、LLM量子化の新たなフロンティアは4ビット精度であり、これにより元のモデルと比較して平均70％のメモリフットプリント削減が達成され、性能の有意な低下は観察されませんでした。さらに、量子化がさらに極端（3ビットおよび2ビット）になると、コード固有のキャリブレーションデータセットが性能の低下を抑えるのに役立つことがわかりました。

English

Large Language Models (LLMs) have shown an impressive capability in code generation and, specifically, to automatically implement requirements described in natural language. The LLM effectiveness generally increases with its size: The higher the number of LLM's trainable parameters the better its ability to implement code. However, when it comes to deploying LLM-based code generators, larger LLMs pose significant challenges related to their memory (and, consequently, carbon) footprint. A previous work by Wei et al. proposed to leverage quantization techniques to reduce the memory footprint of LLM-based code generators without substantially degrading their effectiveness. In short, they studied LLMs featuring up to 16B parameters, quantizing their precision from floating point 32 bits down to int 8 bits and showing their limited impact on code generation performance. Given the fast pace at which LLM capabilities and quantization techniques are evolving, in this work we present a differentiated replication of the work by Wei et al. in which we consider (i) on the one side, more recent and larger code-related LLMs, of up to 34B parameters; (ii) the latest advancements in model quantization techniques, which allow pushing the compression to the extreme quantization level of 2 bits per model parameter and; (iii) different types of calibration datasets to guide the quantization process, including code-specific ones. Our empirical evaluation reveals that the new frontier for LLM quantization is 4-bit precision, resulting in an average memory footprint reduction of 70% compared to the original model without observing any significant decrease in performance. Additionally, when the quantization becomes even more extreme (3 and 2 bits), a code-specific calibration dataset helps to limit the loss of performance.

コード生成のための大規模言語モデルの量子化：差別化された再現

Quantizing Large Language Models for Code Generation: A Differentiated Replication

要旨

Support