MobileQuant: モバイル向けのデバイス内言語モデルのための量子化

要旨

大規模言語モデル（LLMs）は言語処理を革新し、複数のアプリケーションで優れた結果を提供しています。ただし、LLMsをエッジデバイスに展開することは、メモリ、エネルギー、および計算コストに関するいくつかの課題を抱えており、これがモバイル電話などのデバイスでの広範な使用を制限しています。有望な解決策は、重みと活性化を表現するために使用されるビット数を削減することです。既存の研究では、LLMsを4ビットの重みなどの低ビット幅に量子化することで一部成功を収めていますが、16ビットを超える活性化を量子化すると、デバイス上の量子化サポートの不備やかなりの精度低下により、大きな計算オーバーヘッドが発生することがしばしばあります。しかし、8ビットの活性化は、モバイルフレンドリーなハードウェア（例：ニューラルプロセッシングユニット（NPU））を十分に活用できるため、デバイス上での展開に非常に魅力的です。本研究では、整数のみを用いた量子化を用いてLLMsのデバイス上展開を促進する初の試みを行います。まず、既存の量子化手法の制限を調査し、特に活性化の量子化に焦点を当てます。その後、MobileQuantという簡単な事後トレーニング量子化手法を導入することで、これらの制限に対処します。MobileQuantは、重み変換と活性化範囲パラメータを共同で最適化することにより、従来の重み等価変換手法を拡張し、エンドツーエンドで優れた能力を示します。MobileQuantは、広範なLLMベンチマークでほぼ損失のない量子化を達成し、現在のデバイス上の量子化戦略と比較して、レイテンシとエネルギー消費を20\%〜50\%削減し、限られた計算予算で動作し、NPUなどのモバイルフレンドリーな計算ユニットと互換性があります。

English

Large language models (LLMs) have revolutionized language processing, delivering outstanding results across multiple applications. However, deploying LLMs on edge devices poses several challenges with respect to memory, energy, and compute costs, limiting their widespread use in devices such as mobile phones. A promising solution is to reduce the number of bits used to represent weights and activations. While existing works have found partial success at quantizing LLMs to lower bitwidths, e.g. 4-bit weights, quantizing activations beyond 16 bits often leads to large computational overheads due to poor on-device quantization support, or a considerable accuracy drop. Yet, 8-bit activations are very attractive for on-device deployment as they would enable LLMs to fully exploit mobile-friendly hardware, e.g. Neural Processing Units (NPUs). In this work, we make a first attempt to facilitate the on-device deployment of LLMs using integer-only quantization. We first investigate the limitations of existing quantization methods for on-device deployment, with a special focus on activation quantization. We then address these limitations by introducing a simple post-training quantization method, named MobileQuant, that extends previous weight equivalent transformation works by jointly optimizing the weight transformation and activation range parameters in an end-to-end manner. MobileQuant demonstrates superior capabilities over existing methods by 1) achieving near-lossless quantization on a wide range of LLM benchmarks, 2) reducing latency and energy consumption by 20\%-50\% compared to current on-device quantization strategies, 3) requiring limited compute budget, 4) being compatible with mobile-friendly compute units, e.g. NPU.

MobileQuant: モバイル向けのデバイス内言語モデルのための量子化

MobileQuant: Mobile-friendly Quantization for On-device Language Models

要旨

Support