簡稱LLaMA：大型語言模型的簡單深度修剪

摘要

現代大型語言模型（LLMs）的結構剪枝已成為降低其高計算需求的一種方法。寬度剪枝減少投影權重矩陣的大小（例如，通過移除注意力頭），同時保持層數不變。相反，深度剪枝則刪除整個層或塊，同時保持剩餘權重的大小不變。目前大多數研究集中在僅寬度或寬度和深度剪枝的混合方法上，對於這兩種單元（寬度與深度）對LLM推理效率的影響缺乏比較分析。在這項工作中，我們展示了一種簡單的深度剪枝方法可以在零-shot任務表現方面與最近的寬度剪枝方法競爭。我們的剪枝方法提高了推理速度，特別是在需要運行LLMs的有限批量大小的內存受限條件下，此時寬度剪枝效果不佳。我們希望這項工作能幫助在本地和邊緣設備上部署LLMs。

English

Structured pruning of modern large language models (LLMs) has emerged as a way of decreasing their high computational needs. Width pruning reduces the size of projection weight matrices (e.g., by removing attention heads) while maintaining the number of layers. Depth pruning, in contrast, removes entire layers or blocks, while keeping the size of the remaining weights unchanged. Most current research focuses on either width-only or a blend of width and depth pruning, with little comparative analysis between the two units (width vs. depth) concerning their impact on LLM inference efficiency. In this work, we show that a simple depth pruning approach can compete with recent width pruning methods in terms of zero-shot task performance. Our pruning method boosts inference speeds, especially under memory-constrained conditions that require limited batch sizes for running LLMs, where width pruning is ineffective. We hope this work can help deploy LLMs on local and edge devices.

簡稱LLaMA：大型語言模型的簡單深度修剪

Shortened LLaMA: A Simple Depth Pruning for Large Language Models

摘要

Support