LLMサージャン

要旨

最先端の言語モデルは、利用可能な大規模なテキストデータコーパスにおいて最高の性能を達成するために、ますます大規模化しています。しかし、Transformerアーキテクチャの膨大なサイズにより、計算資源、環境、またはデバイス固有の制約内でモデルを展開することが困難になっています。私たちは、ゼロから小さなモデルを訓練する代わりに、既存の事前訓練済みモデルのデータ駆動型圧縮を探求します。そのために、ターゲット損失ランドスケープのKronecker分解された曲率近似を大規模言語モデルにスケーリングします。これにより、削除可能な構造の動的割り当てと、削除を考慮した残りの重みの更新の両方を計算することができます。私たちは、非構造化、半構造化、および構造化プルーニングのための一般的なフレームワークを提供し、重み間の相関をより多く捉えるために重み更新を改善しつつ、計算効率を維持します。実験的に、私たちの方法は、一連のOPTモデルとLlamav2-7Bの行と列を20%-30%プルーニングし、性能の低下をほとんど伴わず、大規模言語モデルの非構造化および半構造化プルーニングにおいて最先端の結果を達成します。

English

State-of-the-art language models are becoming increasingly large in an effort to achieve the highest performance on large corpora of available textual data. However, the sheer size of the Transformer architectures makes it difficult to deploy models within computational, environmental or device-specific constraints. We explore data-driven compression of existing pretrained models as an alternative to training smaller models from scratch. To do so, we scale Kronecker-factored curvature approximations of the target loss landscape to large language models. In doing so, we can compute both the dynamic allocation of structures that can be removed as well as updates of remaining weights that account for the removal. We provide a general framework for unstructured, semi-structured and structured pruning and improve upon weight updates to capture more correlations between weights, while remaining computationally efficient. Experimentally, our method can prune rows and columns from a range of OPT models and Llamav2-7B by 20%-30%, with a negligible loss in performance, and achieve state-of-the-art results in unstructured and semi-structured pruning of large language models.

LLMサージャン

The LLM Surgeon

要旨

Support