DLP：大型语言模型中的动态分层剪枝

摘要

剪枝技术近期被广泛采用，以减少大规模语言模型（LLMs）的参数规模并提升推理效率。主流的剪枝方法多依赖于统一的层级剪枝策略，这在较高稀疏度下往往导致性能显著下降。鉴于LLMs中不同层贡献度的差异，近期研究已转向非均匀层级剪枝。然而，这些方法常依赖预设值，可能导致性能未达最优。为克服这些局限，我们提出了一种名为动态层级剪枝（DLP）的新方法。该方法通过整合模型权重与输入激活信息，自适应地确定各层相对重要性，并据此分配剪枝率。实验结果显示，DLP在多种LLMs上，于高稀疏度下有效保持了模型性能。具体而言，在70%稀疏度下，相较于现有最先进方法，DLP将LLaMA2-7B的困惑度降低了7.79，平均准确率提升了2.7%。此外，DLP兼容多种现有LLM压缩技术，并能无缝融入参数高效微调（PEFT）流程。我们已在https://github.com/ironartisan/DLP发布代码，以促进未来研究。

English

Pruning has recently been widely adopted to reduce the parameter scale and improve the inference efficiency of Large Language Models (LLMs). Mainstream pruning techniques often rely on uniform layerwise pruning strategies, which can lead to severe performance degradation at high sparsity levels. Recognizing the varying contributions of different layers in LLMs, recent studies have shifted their focus toward non-uniform layerwise pruning. However, these approaches often rely on pre-defined values, which can result in suboptimal performance. To overcome these limitations, we propose a novel method called Dynamic Layerwise Pruning (DLP). This approach adaptively determines the relative importance of each layer by integrating model weights with input activation information, assigning pruning rates accordingly. Experimental results show that DLP effectively preserves model performance at high sparsity levels across multiple LLMs. Specifically, at 70% sparsity, DLP reduces the perplexity of LLaMA2-7B by 7.79 and improves the average accuracy by 2.7% compared to state-of-the-art methods. Moreover, DLP is compatible with various existing LLM compression techniques and can be seamlessly integrated into Parameter-Efficient Fine-Tuning (PEFT). We release the code at https://github.com/ironartisan/DLP to facilitate future research.

DLP：大型语言模型中的动态分层剪枝

DLP: Dynamic Layerwise Pruning in Large Language Models

摘要

Support