코드 생성을 위한 대형 언어 모델 양자화: 차별화된 재현 연구

초록

대형 언어 모델(LLM)은 코드 생성, 특히 자연어로 기술된 요구사항을 자동으로 구현하는 데 있어 인상적인 능력을 보여주고 있습니다. LLM의 효과는 일반적으로 크기에 비례하여 증가합니다: 학습 가능한 매개변수의 수가 많을수록 코드 구현 능력이 더 우수해집니다. 그러나 LLM 기반 코드 생성기를 배포할 때, 더 큰 LLM은 메모리(결과적으로 탄소) 사용량과 관련된 상당한 문제를 야기합니다. Wei 등이 이전에 제안한 연구에서는 양자화 기술을 활용하여 LLM 기반 코드 생성기의 메모리 사용량을 크게 저하시키지 않으면서 줄이는 방법을 탐구했습니다. 요약하자면, 그들은 최대 160억 개의 매개변수를 가진 LLM을 연구하며, 부동소수점 32비트에서 정수 8비트로 정밀도를 낮추는 양자화를 적용했고, 이가 코드 생성 성능에 미치는 영향이 제한적임을 보였습니다. LLM의 능력과 양자화 기술이 빠르게 진화하고 있는 상황을 고려하여, 본 연구에서는 Wei 등의 연구를 차별화된 방식으로 재현합니다. 우리는 (i) 최신의 더 큰 코드 관련 LLM(최대 340억 개의 매개변수), (ii) 모델 매개변수당 2비트까지 압축을 가능하게 하는 최신 양자화 기술의 발전, 그리고 (iii) 코드 특화 데이터셋을 포함한 다양한 유형의 보정 데이터셋을 고려합니다. 우리의 실증적 평가는 LLM 양자화의 새로운 경계가 4비트 정밀도임을 보여주며, 이는 원본 모델 대비 평균 70%의 메모리 사용량 감소를 달성하면서도 성능 저하를 거의 관찰하지 못했습니다. 또한, 양자화가 더 극단적으로 적용될 때(3비트 및 2비트), 코드 특화 보정 데이터셋은 성능 저하를 제한하는 데 도움을 줍니다.

English

Large Language Models (LLMs) have shown an impressive capability in code generation and, specifically, to automatically implement requirements described in natural language. The LLM effectiveness generally increases with its size: The higher the number of LLM's trainable parameters the better its ability to implement code. However, when it comes to deploying LLM-based code generators, larger LLMs pose significant challenges related to their memory (and, consequently, carbon) footprint. A previous work by Wei et al. proposed to leverage quantization techniques to reduce the memory footprint of LLM-based code generators without substantially degrading their effectiveness. In short, they studied LLMs featuring up to 16B parameters, quantizing their precision from floating point 32 bits down to int 8 bits and showing their limited impact on code generation performance. Given the fast pace at which LLM capabilities and quantization techniques are evolving, in this work we present a differentiated replication of the work by Wei et al. in which we consider (i) on the one side, more recent and larger code-related LLMs, of up to 34B parameters; (ii) the latest advancements in model quantization techniques, which allow pushing the compression to the extreme quantization level of 2 bits per model parameter and; (iii) different types of calibration datasets to guide the quantization process, including code-specific ones. Our empirical evaluation reveals that the new frontier for LLM quantization is 4-bit precision, resulting in an average memory footprint reduction of 70% compared to the original model without observing any significant decrease in performance. Additionally, when the quantization becomes even more extreme (3 and 2 bits), a code-specific calibration dataset helps to limit the loss of performance.

코드 생성을 위한 대형 언어 모델 양자화: 차별화된 재현 연구

Quantizing Large Language Models for Code Generation: A Differentiated Replication

초록

Support