MobileQuant: 모바일 친화적 양자화를 위한 장치 내 언어 모델

초록

대형 언어 모델(LLMs)은 언어 처리를 혁신적으로 바꾸어, 여러 응용 분야에서 우수한 결과를 제공합니다. 그러나 에지 장치에 LLMs를 배치하는 것은 메모리, 에너지, 그리고 계산 비용과 관련하여 여러 가지 도전을 야기하여, 이로 인해 휴대전화와 같은 장치에서의 널리 사용이 제한됩니다. 유망한 해결책은 가중치와 활성화를 표현하는 데 사용되는 비트 수를 줄이는 것입니다. 기존 연구들은 LLMs를 낮은 비트폭으로 양자화하는 데 일부 성공을 거두었지만, 예를 들어 4비트 가중치와 같은 것들이 있습니다. 그러나 16비트를 초과하는 활성화를 양자화하면 종종 장치 내 양자화 지원의 부족이나 상당한 정확도 하락으로 인해 큰 계산 부담이 발생합니다. 그럼에도 불구하고, 8비트 활성화는 휴대용 하드웨어, 예를 들어 신경 처리 장치(NPUs)를 완전히 활용할 수 있기 때문에 장치 내 배치에 매우 매력적입니다. 본 연구에서는 정수만을 사용한 양자화를 통해 LLMs의 장치 내 배치를 용이하게 하는 최초의 시도를 합니다. 먼저, 기존 양자화 방법의 한계를 조사하고, 특히 활성화 양자화에 중점을 두어 장치 내 배치를 위한 제한 사항을 다룹니다. 그런 다음, MobileQuant라는 간단한 사후 훈련 양자화 방법을 소개하여, 가중치 변환과 활성화 범위 매개변수를 함께 최적화하여 이를 종단 간 방식으로 해결합니다. MobileQuant는 기존 방법보다 우수한 성능을 보여주며, 1) 다양한 LLM 벤치마크에서 거의 손실이 없는 양자화를 달성하고, 2) 현재 장치 내 양자화 전략과 비교하여 20\%-50\%의 지연 시간과 에너지 소비를 줄이며, 3) 제한된 계산 예산이 필요하며, 4) NPU와 같은 휴대용 계산 장치와 호환됩니다.

English

Large language models (LLMs) have revolutionized language processing, delivering outstanding results across multiple applications. However, deploying LLMs on edge devices poses several challenges with respect to memory, energy, and compute costs, limiting their widespread use in devices such as mobile phones. A promising solution is to reduce the number of bits used to represent weights and activations. While existing works have found partial success at quantizing LLMs to lower bitwidths, e.g. 4-bit weights, quantizing activations beyond 16 bits often leads to large computational overheads due to poor on-device quantization support, or a considerable accuracy drop. Yet, 8-bit activations are very attractive for on-device deployment as they would enable LLMs to fully exploit mobile-friendly hardware, e.g. Neural Processing Units (NPUs). In this work, we make a first attempt to facilitate the on-device deployment of LLMs using integer-only quantization. We first investigate the limitations of existing quantization methods for on-device deployment, with a special focus on activation quantization. We then address these limitations by introducing a simple post-training quantization method, named MobileQuant, that extends previous weight equivalent transformation works by jointly optimizing the weight transformation and activation range parameters in an end-to-end manner. MobileQuant demonstrates superior capabilities over existing methods by 1) achieving near-lossless quantization on a wide range of LLM benchmarks, 2) reducing latency and energy consumption by 20\%-50\% compared to current on-device quantization strategies, 3) requiring limited compute budget, 4) being compatible with mobile-friendly compute units, e.g. NPU.

MobileQuant: 모바일 친화적 양자화를 위한 장치 내 언어 모델

MobileQuant: Mobile-friendly Quantization for On-device Language Models

초록

Support