短縮版LLaMA：大規模言語モデルのための簡易な深さプルーニング

要旨

現代の大規模言語モデル（LLM）における構造化プルーニングは、その高い計算需要を削減する方法として注目されています。幅プルーニングは、投影重み行列のサイズを縮小し（例えば、アテンションヘッドを削除することで）、層の数を維持します。一方、深さプルーニングは、層全体またはブロックを削除し、残りの重みのサイズを変更しません。現在の研究の多くは、幅のみのプルーニング、または幅と深さの組み合わせに焦点を当てており、LLMの推論効率に対するこれら二つの単位（幅 vs 深さ）の比較分析はほとんど行われていません。本研究では、シンプルな深さプルーニングアプローチが、ゼロショットタスクの性能において、最近の幅プルーニング手法と競合し得ることを示します。私たちのプルーニング手法は、特にメモリ制約下でLLMを実行するために限られたバッチサイズが必要な状況において、推論速度を向上させます。このような状況では、幅プルーニングは効果的ではありません。この研究が、LLMをローカルおよびエッジデバイスに展開する一助となることを願っています。

English

Structured pruning of modern large language models (LLMs) has emerged as a way of decreasing their high computational needs. Width pruning reduces the size of projection weight matrices (e.g., by removing attention heads) while maintaining the number of layers. Depth pruning, in contrast, removes entire layers or blocks, while keeping the size of the remaining weights unchanged. Most current research focuses on either width-only or a blend of width and depth pruning, with little comparative analysis between the two units (width vs. depth) concerning their impact on LLM inference efficiency. In this work, we show that a simple depth pruning approach can compete with recent width pruning methods in terms of zero-shot task performance. Our pruning method boosts inference speeds, especially under memory-constrained conditions that require limited batch sizes for running LLMs, where width pruning is ineffective. We hope this work can help deploy LLMs on local and edge devices.

短縮版LLaMA：大規模言語モデルのための簡易な深さプルーニング

Shortened LLaMA: A Simple Depth Pruning for Large Language Models

要旨

Support