SliceGPT:透過刪除行和列來壓縮大型語言模型
SliceGPT: Compress Large Language Models by Deleting Rows and Columns
January 26, 2024
作者: Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman
cs.AI
摘要
大型語言模型已成為自然語言處理的基石,但在計算和記憶體資源方面使用這些模型會帶來相當大的成本。稀疏化提供了一種解決方案來緩解這些資源限制,最近的研究表明,訓練過的模型可以在事後進行稀疏化。現有的稀疏化技術面臨挑戰,因為它們需要額外的資料結構,並且在當前硬體上提供了受限的加速。在本文中,我們提出了SliceGPT,一種新的事後訓練稀疏化方案,它將每個權重矩陣替換為一個更小的(密集的)矩陣,從而降低了網絡的嵌入維度。通過大量實驗,我們展示了SliceGPT可以刪除LLAMA2-70B、OPT 66B和Phi-2模型中高達25%的模型參數(包括嵌入),同時分別保持了密集模型的99%、99%和90%的零-shot任務性能。我們的切片模型在更少的GPU上運行並且運行速度更快,無需進行任何額外的代碼優化:在24GB消費者GPU上,我們將LLAMA2-70B的推理計算總量減少到密集模型的64%;在40GB的A100 GPU上,我們將其減少到66%。我們提供了一個新的見解,即變壓器網絡中的計算不變性,這使得SliceGPT成為可能,我們希望它將激發並促使未來減少預訓練模型的記憶體和計算需求的新途徑。代碼可在以下鏈接找到:https://github.com/microsoft/TransformerCompression
English
Large language models have become the cornerstone of natural language
processing, but their use comes with substantial costs in terms of compute and
memory resources. Sparsification provides a solution to alleviate these
resource constraints, and recent works have shown that trained models can be
sparsified post-hoc. Existing sparsification techniques face challenges as they
need additional data structures and offer constrained speedup with current
hardware. In this paper we present SliceGPT, a new post-training sparsification
scheme which replaces each weight matrix with a smaller (dense) matrix,
reducing the embedding dimension of the network. Through extensive
experimentation, we show that SliceGPT can remove up to 25% of the model
parameters (including embeddings) for LLAMA2-70B, OPT 66B and Phi-2 models
while maintaining 99%, 99% and 90% zero-shot task performance of the dense
model respectively. Our sliced models run on fewer GPUs and run faster without
any additional code optimization: on 24GB consumer GPUs we reduce the total
compute for inference on LLAMA2-70B to 64% of that of the dense model; on 40GB
A100 GPUs we reduce it to 66%. We offer a new insight, computational invariance
in transformer networks, which enables SliceGPT and we hope it will inspire and
enable future avenues to reduce memory and computation demands for pre-trained
models. Code is available at:
https://github.com/microsoft/TransformerCompression