ChatPaper.aiChatPaper

SliceGPT:通过删除行和列压缩大型语言模型

SliceGPT: Compress Large Language Models by Deleting Rows and Columns

January 26, 2024
作者: Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman
cs.AI

摘要

大型语言模型已成为自然语言处理的基石,但它们的使用会带来大量的计算和内存资源成本。稀疏化提供了一种解决方案来缓解这些资源约束,最近的研究表明,经过训练的模型可以在事后进行稀疏化。现有的稀疏化技术面临挑战,因为它们需要额外的数据结构,并且在当前硬件上提供了受限的加速。在本文中,我们提出了SliceGPT,一种新的后训练稀疏化方案,它用较小的(密集的)矩阵替换每个权重矩阵,从而降低网络的嵌入维度。通过大量实验,我们展示了SliceGPT可以删除LLAMA2-70B、OPT 66B和Phi-2模型中高达25%的模型参数(包括嵌入),同时分别保持了密集模型的99%、99%和90%的零样本任务性能。我们的切片模型在更少的GPU上运行,并且速度更快,无需任何额外的代码优化:在24GB消费级GPU上,我们将LLAMA2-70B的推断计算总量减少到密集模型的64%;在40GB的A100 GPU上,我们将其减少到66%。我们提供了一个新的见解,即变压器网络中的计算不变性,这使得SliceGPT成为可能,并希望它能激发和促成未来减少预训练模型内存和计算需求的途径。代码可在以下链接找到:https://github.com/microsoft/TransformerCompression
English
Large language models have become the cornerstone of natural language processing, but their use comes with substantial costs in terms of compute and memory resources. Sparsification provides a solution to alleviate these resource constraints, and recent works have shown that trained models can be sparsified post-hoc. Existing sparsification techniques face challenges as they need additional data structures and offer constrained speedup with current hardware. In this paper we present SliceGPT, a new post-training sparsification scheme which replaces each weight matrix with a smaller (dense) matrix, reducing the embedding dimension of the network. Through extensive experimentation, we show that SliceGPT can remove up to 25% of the model parameters (including embeddings) for LLAMA2-70B, OPT 66B and Phi-2 models while maintaining 99%, 99% and 90% zero-shot task performance of the dense model respectively. Our sliced models run on fewer GPUs and run faster without any additional code optimization: on 24GB consumer GPUs we reduce the total compute for inference on LLAMA2-70B to 64% of that of the dense model; on 40GB A100 GPUs we reduce it to 66%. We offer a new insight, computational invariance in transformer networks, which enables SliceGPT and we hope it will inspire and enable future avenues to reduce memory and computation demands for pre-trained models. Code is available at: https://github.com/microsoft/TransformerCompression
PDF746December 15, 2024