MobileQuant：適用於設備上語言模型的行動友好量化

摘要

大型語言模型（LLMs）已經在語言處理方面引起了革命性的變化，在多個應用中取得了優秀的成果。然而，在邊緣設備上部署LLMs會面臨一些挑戰，例如記憶體、能源和計算成本，這限制了它們在移動手機等設備上的廣泛應用。一個有前途的解決方案是減少用於表示權重和激活的位數。儘管現有的研究在將LLMs量化為較低位寬（例如4位權重）方面取得了部分成功，但將激活量化超過16位往往會導致大量的計算開銷，這是由於設備上的量化支持不足，或者會導致顯著的準確度下降。然而，8位激活對於在設備上部署非常有吸引力，因為它們將使LLMs能夠充分利用適用於移動設備的硬件，例如神經處理單元（NPUs）。在這項工作中，我們首次嘗試使用僅整數量化來促進LLMs在設備上的部署。我們首先研究現有量化方法在設備上部署時的限制，特別關注激活量化。然後，通過引入一種名為MobileQuant的簡單的後訓練量化方法來解決這些限制，該方法通過共同優化權重轉換和激活範圍參數來端對端地擴展以前的權重等效轉換方法。MobileQuant在以下方面展現出優越的能力，比現有方法：1）在廣泛的LLM基準測試中實現接近無損的量化，2）與當前設備上的量化策略相比，將延遲和能源消耗減少20％-50％，3）需要有限的計算預算，4）與適用於移動設備的計算單元（例如NPU）兼容。

English

Large language models (LLMs) have revolutionized language processing, delivering outstanding results across multiple applications. However, deploying LLMs on edge devices poses several challenges with respect to memory, energy, and compute costs, limiting their widespread use in devices such as mobile phones. A promising solution is to reduce the number of bits used to represent weights and activations. While existing works have found partial success at quantizing LLMs to lower bitwidths, e.g. 4-bit weights, quantizing activations beyond 16 bits often leads to large computational overheads due to poor on-device quantization support, or a considerable accuracy drop. Yet, 8-bit activations are very attractive for on-device deployment as they would enable LLMs to fully exploit mobile-friendly hardware, e.g. Neural Processing Units (NPUs). In this work, we make a first attempt to facilitate the on-device deployment of LLMs using integer-only quantization. We first investigate the limitations of existing quantization methods for on-device deployment, with a special focus on activation quantization. We then address these limitations by introducing a simple post-training quantization method, named MobileQuant, that extends previous weight equivalent transformation works by jointly optimizing the weight transformation and activation range parameters in an end-to-end manner. MobileQuant demonstrates superior capabilities over existing methods by 1) achieving near-lossless quantization on a wide range of LLM benchmarks, 2) reducing latency and energy consumption by 20\%-50\% compared to current on-device quantization strategies, 3) requiring limited compute budget, 4) being compatible with mobile-friendly compute units, e.g. NPU.

MobileQuant：適用於設備上語言模型的行動友好量化

MobileQuant: Mobile-friendly Quantization for On-device Language Models

摘要

Support