ShiftAddLLM:通过训练后的无乘重参数化加速预训练LLM
ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization
June 10, 2024
作者: Haoran You, Yipin Guo, Yichao Fu, Wei Zhou, Huihong Shi, Xiaofan Zhang, Souvik Kundu, Amir Yazdanbakhsh, Yingyan Lin
cs.AI
摘要
大型语言模型(LLMs)在语言任务上表现出色,但由于其庞大的参数和对密集乘法的依赖,在资源受限设备上部署时面临挑战,导致内存需求高和延迟瓶颈。移位加法重新参数化通过在LLM的注意力和多层感知器(MLP)层中用硬件友好的基元替换昂贵的乘法,提供了一种有前途的解决方案。然而,当前的重新参数化技术要求从头开始训练或完全参数微调以恢复准确性,这对LLMs来说是资源密集型的。为了解决这个问题,我们提出了通过后训练的移位加法重新参数化来加速预训练的LLMs,创建高效的无乘法模型,称为ShiftAddLLM。具体来说,我们将每个权重矩阵量化为与分组缩放因子配对的二进制矩阵。相关的乘法被重新参数化为(1)激活和缩放因子之间的位移和(2)根据二进制矩阵的查询和加法。为了减少准确性损失,我们提出了一种多目标优化方法,以最小化权重和输出激活重新参数化误差。此外,基于各层对重新参数化的敏感性不同,我们开发了一种自动位分配策略,进一步减少内存使用和延迟。在五个LLM系列和八个任务上的实验一致验证了ShiftAddLLM的有效性,相比于最具竞争力的3位和2位量化LLMs,实现了平均困惑度提高5.6和22.7个点,同时延迟相当或更低,并且比原始LLMs减少了80%以上的内存和能量消耗。代码和模型可在https://github.com/GATECH-EIC/ShiftAddLLM获取。
English
Large language models (LLMs) have shown impressive performance on language
tasks but face challenges when deployed on resource-constrained devices due to
their extensive parameters and reliance on dense multiplications, resulting in
high memory demands and latency bottlenecks. Shift-and-add reparameterization
offers a promising solution by replacing costly multiplications with
hardware-friendly primitives in both the attention and multi-layer perceptron
(MLP) layers of an LLM. However, current reparameterization techniques require
training from scratch or full parameter fine-tuning to restore accuracy, which
is resource-intensive for LLMs. To address this, we propose accelerating
pretrained LLMs through post-training shift-and-add reparameterization,
creating efficient multiplication-free models, dubbed ShiftAddLLM.
Specifically, we quantize each weight matrix into binary matrices paired with
group-wise scaling factors. The associated multiplications are reparameterized
into (1) shifts between activations and scaling factors and (2) queries and
adds according to the binary matrices. To reduce accuracy loss, we present a
multi-objective optimization method to minimize both weight and output
activation reparameterization errors. Additionally, based on varying
sensitivity across layers to reparameterization, we develop an automated bit
allocation strategy to further reduce memory usage and latency. Experiments on
five LLM families and eight tasks consistently validate the effectiveness of
ShiftAddLLM, achieving average perplexity improvements of 5.6 and 22.7 points
at comparable or lower latency compared to the most competitive quantized LLMs
at 3 and 2 bits, respectively, and more than 80% memory and energy reductions
over the original LLMs. Codes and models are available at
https://github.com/GATECH-EIC/ShiftAddLLM.Summary
AI-Generated Summary