DLP: 大規模言語モデルにおける動的層別プルーニング

要旨

プルーニングは近年、大規模言語モデル（LLMs）のパラメータ規模を削減し、推論効率を向上させるために広く採用されています。主流のプルーニング技術は、均一なレイヤーごとのプルーニング戦略に依存することが多く、高いスパース性レベルでは性能の大幅な低下を引き起こす可能性があります。LLMsの各レイヤーの貢献度が異なることを認識した最近の研究では、非均一なレイヤーごとのプルーニングに焦点が移っています。しかし、これらのアプローチは事前に定義された値に依存することが多く、最適な性能が得られない場合があります。これらの制限を克服するため、我々はDynamic Layerwise Pruning（DLP）と呼ばれる新しい手法を提案します。このアプローチは、モデルの重みと入力活性化情報を統合することで各レイヤーの相対的な重要性を適応的に決定し、それに応じてプルーニング率を割り当てます。実験結果は、DLPが複数のLLMsにおいて高いスパース性レベルでもモデルの性能を効果的に維持することを示しています。具体的には、70%のスパース性において、DLPはLLaMA2-7Bのパープレキシティを7.79減少させ、最先端の手法と比較して平均精度を2.7%向上させました。さらに、DLPは様々な既存のLLM圧縮技術と互換性があり、Parameter-Efficient Fine-Tuning（PEFT）にシームレスに統合できます。今後の研究を促進するため、コードをhttps://github.com/ironartisan/DLPで公開しています。

English

Pruning has recently been widely adopted to reduce the parameter scale and improve the inference efficiency of Large Language Models (LLMs). Mainstream pruning techniques often rely on uniform layerwise pruning strategies, which can lead to severe performance degradation at high sparsity levels. Recognizing the varying contributions of different layers in LLMs, recent studies have shifted their focus toward non-uniform layerwise pruning. However, these approaches often rely on pre-defined values, which can result in suboptimal performance. To overcome these limitations, we propose a novel method called Dynamic Layerwise Pruning (DLP). This approach adaptively determines the relative importance of each layer by integrating model weights with input activation information, assigning pruning rates accordingly. Experimental results show that DLP effectively preserves model performance at high sparsity levels across multiple LLMs. Specifically, at 70% sparsity, DLP reduces the perplexity of LLaMA2-7B by 7.79 and improves the average accuracy by 2.7% compared to state-of-the-art methods. Moreover, DLP is compatible with various existing LLM compression techniques and can be seamlessly integrated into Parameter-Efficient Fine-Tuning (PEFT). We release the code at https://github.com/ironartisan/DLP to facilitate future research.

DLP: 大規模言語モデルにおける動的層別プルーニング

DLP: Dynamic Layerwise Pruning in Large Language Models

要旨

Support