简化LLaMA：大型语言模型的简单深度修剪

摘要

现代大型语言模型（LLMs）的结构化剪枝已经成为降低其高计算需求的一种方式。宽度剪枝减少投影权重矩阵的大小（例如，通过移除注意力头部），同时保持层数不变。相比之下，深度剪枝会移除整个层或块，同时保持剩余权重的大小不变。目前大部分研究集中在仅宽度或宽度和深度剪枝的混合方法上，对于它们对LLM推理效率影响的比较分析较少。在这项工作中，我们展示了一种简单的深度剪枝方法可以在零-shot任务性能方面与最近的宽度剪枝方法竞争。我们的剪枝方法提高了推理速度，特别是在需要限制批量大小来运行LLMs的内存受限条件下，这种情况下宽度剪枝是无效的。我们希望这项工作能帮助在本地和边缘设备上部署LLMs。

English

Structured pruning of modern large language models (LLMs) has emerged as a way of decreasing their high computational needs. Width pruning reduces the size of projection weight matrices (e.g., by removing attention heads) while maintaining the number of layers. Depth pruning, in contrast, removes entire layers or blocks, while keeping the size of the remaining weights unchanged. Most current research focuses on either width-only or a blend of width and depth pruning, with little comparative analysis between the two units (width vs. depth) concerning their impact on LLM inference efficiency. In this work, we show that a simple depth pruning approach can compete with recent width pruning methods in terms of zero-shot task performance. Our pruning method boosts inference speeds, especially under memory-constrained conditions that require limited batch sizes for running LLMs, where width pruning is ineffective. We hope this work can help deploy LLMs on local and edge devices.

简化LLaMA：大型语言模型的简单深度修剪

Shortened LLaMA: A Simple Depth Pruning for Large Language Models

摘要

Support