MobileQuant:适用于设备端语言模型的移动友好量化
MobileQuant: Mobile-friendly Quantization for On-device Language Models
August 25, 2024
作者: Fuwen Tan, Royson Lee, Łukasz Dudziak, Shell Xu Hu, Sourav Bhattacharya, Timothy Hospedales, Georgios Tzimiropoulos, Brais Martinez
cs.AI
摘要
大型语言模型(LLMs)已经彻底改变了语言处理,为多个应用程序提供了出色的结果。然而,在边缘设备上部署LLMs会面临一些挑战,如内存、能耗和计算成本,这限制了它们在诸如手机等设备上的广泛应用。一种有前途的解决方案是减少用于表示权重和激活的位数。尽管现有研究在将LLMs量化为较低位宽(例如4位权重)方面取得了部分成功,但将激活量化超过16位通常会导致大量计算开销,因为设备上的量化支持不足,或者会导致显著的准确性下降。然而,8位激活对于在设备上部署非常有吸引力,因为这将使LLMs能够充分利用手机友好型硬件,例如神经处理单元(NPUs)。在这项工作中,我们首次尝试使用仅整数量化来促进LLMs在设备上的部署。我们首先研究了现有量化方法在设备上部署方面的局限性,特别关注激活量化。然后,我们通过引入一种名为MobileQuant的简单后训练量化方法来解决这些限制,该方法通过联合优化权重转换和激活范围参数以端到端的方式扩展了先前的权重等效转换方法。MobileQuant通过以下方式展现了优于现有方法的能力:1)在广泛的LLMs基准测试中实现几乎无损量化,2)与当前设备上量化策略相比,减少了20\%-50\%的延迟和能耗,3)需要有限的计算预算,4)与手机友好的计算单元(如NPU)兼容。
English
Large language models (LLMs) have revolutionized language processing,
delivering outstanding results across multiple applications. However, deploying
LLMs on edge devices poses several challenges with respect to memory, energy,
and compute costs, limiting their widespread use in devices such as mobile
phones. A promising solution is to reduce the number of bits used to represent
weights and activations. While existing works have found partial success at
quantizing LLMs to lower bitwidths, e.g. 4-bit weights, quantizing activations
beyond 16 bits often leads to large computational overheads due to poor
on-device quantization support, or a considerable accuracy drop. Yet, 8-bit
activations are very attractive for on-device deployment as they would enable
LLMs to fully exploit mobile-friendly hardware, e.g. Neural Processing Units
(NPUs). In this work, we make a first attempt to facilitate the on-device
deployment of LLMs using integer-only quantization. We first investigate the
limitations of existing quantization methods for on-device deployment, with a
special focus on activation quantization. We then address these limitations by
introducing a simple post-training quantization method, named MobileQuant, that
extends previous weight equivalent transformation works by jointly optimizing
the weight transformation and activation range parameters in an end-to-end
manner. MobileQuant demonstrates superior capabilities over existing methods by
1) achieving near-lossless quantization on a wide range of LLM benchmarks, 2)
reducing latency and energy consumption by 20\%-50\% compared to current
on-device quantization strategies, 3) requiring limited compute budget, 4)
being compatible with mobile-friendly compute units, e.g. NPU.Summary
AI-Generated Summary