ShiftAddLLM：通过训练后的无乘重参数化加速预训练LLM

摘要

大型语言模型（LLMs）在语言任务上表现出色，但由于其庞大的参数和对密集乘法的依赖，在资源受限设备上部署时面临挑战，导致内存需求高和延迟瓶颈。移位加法重新参数化通过在LLM的注意力和多层感知器（MLP）层中用硬件友好的基元替换昂贵的乘法，提供了一种有前途的解决方案。然而，当前的重新参数化技术要求从头开始训练或完全参数微调以恢复准确性，这对LLMs来说是资源密集型的。为了解决这个问题，我们提出了通过后训练的移位加法重新参数化来加速预训练的LLMs，创建高效的无乘法模型，称为ShiftAddLLM。具体来说，我们将每个权重矩阵量化为与分组缩放因子配对的二进制矩阵。相关的乘法被重新参数化为（1）激活和缩放因子之间的位移和（2）根据二进制矩阵的查询和加法。为了减少准确性损失，我们提出了一种多目标优化方法，以最小化权重和输出激活重新参数化误差。此外，基于各层对重新参数化的敏感性不同，我们开发了一种自动位分配策略，进一步减少内存使用和延迟。在五个LLM系列和八个任务上的实验一致验证了ShiftAddLLM的有效性，相比于最具竞争力的3位和2位量化LLMs，实现了平均困惑度提高5.6和22.7个点，同时延迟相当或更低，并且比原始LLMs减少了80%以上的内存和能量消耗。代码和模型可在https://github.com/GATECH-EIC/ShiftAddLLM获取。

English

Large language models (LLMs) have shown impressive performance on language tasks but face challenges when deployed on resource-constrained devices due to their extensive parameters and reliance on dense multiplications, resulting in high memory demands and latency bottlenecks. Shift-and-add reparameterization offers a promising solution by replacing costly multiplications with hardware-friendly primitives in both the attention and multi-layer perceptron (MLP) layers of an LLM. However, current reparameterization techniques require training from scratch or full parameter fine-tuning to restore accuracy, which is resource-intensive for LLMs. To address this, we propose accelerating pretrained LLMs through post-training shift-and-add reparameterization, creating efficient multiplication-free models, dubbed ShiftAddLLM. Specifically, we quantize each weight matrix into binary matrices paired with group-wise scaling factors. The associated multiplications are reparameterized into (1) shifts between activations and scaling factors and (2) queries and adds according to the binary matrices. To reduce accuracy loss, we present a multi-objective optimization method to minimize both weight and output activation reparameterization errors. Additionally, based on varying sensitivity across layers to reparameterization, we develop an automated bit allocation strategy to further reduce memory usage and latency. Experiments on five LLM families and eight tasks consistently validate the effectiveness of ShiftAddLLM, achieving average perplexity improvements of 5.6 and 22.7 points at comparable or lower latency compared to the most competitive quantized LLMs at 3 and 2 bits, respectively, and more than 80% memory and energy reductions over the original LLMs. Codes and models are available at https://github.com/GATECH-EIC/ShiftAddLLM.

ShiftAddLLM：通过训练后的无乘重参数化加速预训练LLM

ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization

摘要

Summary

Support

Support