ShiftAddLLM:通過訓練後的無乘法重參數化加速預訓練LLM
ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization
June 10, 2024
作者: Haoran You, Yipin Guo, Yichao Fu, Wei Zhou, Huihong Shi, Xiaofan Zhang, Souvik Kundu, Amir Yazdanbakhsh, Yingyan Lin
cs.AI
摘要
大型語言模型(LLMs)在語言任務上表現出色,但在資源受限設備上部署時面臨挑戰,原因是其龐大的參數和對密集乘法的依賴,導致高內存需求和延遲瓶頸。位移和加法重新參數化提供了一個有前途的解決方案,通過在LLM的注意力和多層感知器(MLP)層中將昂貴的乘法替換為硬件友好的基元。然而,目前的重新參數化技術需要從頭開始訓練或進行完整的參數微調以恢復準確性,這對LLMs來說是資源密集型的。為了解決這個問題,我們提出了通過後訓練位移和加法重新參數化來加速預訓練的LLMs,創建高效的無乘法模型,稱為ShiftAddLLM。具體來說,我們將每個權重矩陣量化為與分組比例因子配對的二進制矩陣。相關的乘法被重新參數化為(1)激活和比例因子之間的位移和(2)根據二進制矩陣的查詢和添加。為了減少準確性損失,我們提出了一種多目標優化方法,以最小化權重和輸出激活重新參數化錯誤。此外,基於各層對重新參數化的敏感性不同,我們制定了一種自動位分配策略,進一步減少內存使用和延遲。對五個LLM系列和八個任務的實驗一致驗證了ShiftAddLLM的有效性,與3位和2位的最具競爭力的量化LLMs相比,在可比或更低的延遲下實現了平均困惑度提高了5.6和22.7個點,原始LLMs的內存和能源減少超過80%。代碼和模型可在https://github.com/GATECH-EIC/ShiftAddLLM找到。
English
Large language models (LLMs) have shown impressive performance on language
tasks but face challenges when deployed on resource-constrained devices due to
their extensive parameters and reliance on dense multiplications, resulting in
high memory demands and latency bottlenecks. Shift-and-add reparameterization
offers a promising solution by replacing costly multiplications with
hardware-friendly primitives in both the attention and multi-layer perceptron
(MLP) layers of an LLM. However, current reparameterization techniques require
training from scratch or full parameter fine-tuning to restore accuracy, which
is resource-intensive for LLMs. To address this, we propose accelerating
pretrained LLMs through post-training shift-and-add reparameterization,
creating efficient multiplication-free models, dubbed ShiftAddLLM.
Specifically, we quantize each weight matrix into binary matrices paired with
group-wise scaling factors. The associated multiplications are reparameterized
into (1) shifts between activations and scaling factors and (2) queries and
adds according to the binary matrices. To reduce accuracy loss, we present a
multi-objective optimization method to minimize both weight and output
activation reparameterization errors. Additionally, based on varying
sensitivity across layers to reparameterization, we develop an automated bit
allocation strategy to further reduce memory usage and latency. Experiments on
five LLM families and eight tasks consistently validate the effectiveness of
ShiftAddLLM, achieving average perplexity improvements of 5.6 and 22.7 points
at comparable or lower latency compared to the most competitive quantized LLMs
at 3 and 2 bits, respectively, and more than 80% memory and energy reductions
over the original LLMs. Codes and models are available at
https://github.com/GATECH-EIC/ShiftAddLLM.Summary
AI-Generated Summary