ShiftAddLLM：通過訓練後的無乘法重參數化加速預訓練LLM

摘要

大型語言模型（LLMs）在語言任務上表現出色，但在資源受限設備上部署時面臨挑戰，原因是其龐大的參數和對密集乘法的依賴，導致高內存需求和延遲瓶頸。位移和加法重新參數化提供了一個有前途的解決方案，通過在LLM的注意力和多層感知器（MLP）層中將昂貴的乘法替換為硬件友好的基元。然而，目前的重新參數化技術需要從頭開始訓練或進行完整的參數微調以恢復準確性，這對LLMs來說是資源密集型的。為了解決這個問題，我們提出了通過後訓練位移和加法重新參數化來加速預訓練的LLMs，創建高效的無乘法模型，稱為ShiftAddLLM。具體來說，我們將每個權重矩陣量化為與分組比例因子配對的二進制矩陣。相關的乘法被重新參數化為（1）激活和比例因子之間的位移和（2）根據二進制矩陣的查詢和添加。為了減少準確性損失，我們提出了一種多目標優化方法，以最小化權重和輸出激活重新參數化錯誤。此外，基於各層對重新參數化的敏感性不同，我們制定了一種自動位分配策略，進一步減少內存使用和延遲。對五個LLM系列和八個任務的實驗一致驗證了ShiftAddLLM的有效性，與3位和2位的最具競爭力的量化LLMs相比，在可比或更低的延遲下實現了平均困惑度提高了5.6和22.7個點，原始LLMs的內存和能源減少超過80％。代碼和模型可在https://github.com/GATECH-EIC/ShiftAddLLM找到。

English

Large language models (LLMs) have shown impressive performance on language tasks but face challenges when deployed on resource-constrained devices due to their extensive parameters and reliance on dense multiplications, resulting in high memory demands and latency bottlenecks. Shift-and-add reparameterization offers a promising solution by replacing costly multiplications with hardware-friendly primitives in both the attention and multi-layer perceptron (MLP) layers of an LLM. However, current reparameterization techniques require training from scratch or full parameter fine-tuning to restore accuracy, which is resource-intensive for LLMs. To address this, we propose accelerating pretrained LLMs through post-training shift-and-add reparameterization, creating efficient multiplication-free models, dubbed ShiftAddLLM. Specifically, we quantize each weight matrix into binary matrices paired with group-wise scaling factors. The associated multiplications are reparameterized into (1) shifts between activations and scaling factors and (2) queries and adds according to the binary matrices. To reduce accuracy loss, we present a multi-objective optimization method to minimize both weight and output activation reparameterization errors. Additionally, based on varying sensitivity across layers to reparameterization, we develop an automated bit allocation strategy to further reduce memory usage and latency. Experiments on five LLM families and eight tasks consistently validate the effectiveness of ShiftAddLLM, achieving average perplexity improvements of 5.6 and 22.7 points at comparable or lower latency compared to the most competitive quantized LLMs at 3 and 2 bits, respectively, and more than 80% memory and energy reductions over the original LLMs. Codes and models are available at https://github.com/GATECH-EIC/ShiftAddLLM.

ShiftAddLLM：通過訓練後的無乘法重參數化加速預訓練LLM

ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization

摘要

Summary

Support

Support