ShortGPT: 大規模言語モデルの層は予想以上に冗長である

要旨

大規模言語モデル（LLMs）の性能が向上し続ける中で、その規模も大幅に拡大しており、現在のLLMsは数十億から数兆ものパラメータを含んでいます。しかし、本研究では、LLMsの多くの層が高い類似性を示し、一部の層はネットワークの機能においてほとんど役割を果たしていないことを発見しました。この観察に基づいて、我々は各層の重要性を測定するための指標としてBlock Influence（BI）を定義しました。そして、BIスコアに基づいてLLMsの冗長な層を直接削除するシンプルなプルーニング手法、すなわち層削除を提案します。実験結果から、我々の手法であるShortGPTは、従来の最先端（SOTA）のモデルプルーニング手法を大幅に上回る性能を示すことが明らかになりました。さらに、ShortGPTは量子化のような手法と直交しており、パラメータと計算量をさらに削減することが可能です。複雑なプルーニング技術ではなく、単純な層削除によってより良い結果を達成できるという事実は、モデルアーキテクチャに高い冗長性が存在することを示唆しています。

English

As Large Language Models (LLMs) continue to advance in performance, their size has escalated significantly, with current LLMs containing billions or even trillions of parameters. However, in this study, we discovered that many layers of LLMs exhibit high similarity, and some layers play a negligible role in network functionality. Based on this observation, we define a metric called Block Influence (BI) to gauge the significance of each layer in LLMs. We then propose a straightforward pruning approach: layer removal, in which we directly delete the redundant layers in LLMs based on their BI scores. Experiments demonstrate that our method, which we call ShortGPT, significantly outperforms previous state-of-the-art (SOTA) methods in model pruning. Moreover, ShortGPT is orthogonal to quantization-like methods, enabling further reduction in parameters and computation. The ability to achieve better results through simple layer removal, as opposed to more complex pruning techniques, suggests a high degree of redundancy in the model architecture.

ShortGPT: 大規模言語モデルの層は予想以上に冗長である

ShortGPT: Layers in Large Language Models are More Redundant Than You Expect

要旨

Support