EvoPress: 進化的探索を通じた最適な動的モデル圧縮に向けて

要旨

大規模言語モデル（LLM）の高い計算コストは、量子化、疎行列化、または構造化されたプルーニングなどの方法を通じたLLMの圧縮に関する研究の急増をもたらしました。この領域における新たなフロンティアは、ブロックごとまたはレイヤーごとに圧縮レベル（例：疎行列性）を調整し、精度の損失を最小限に抑えながら、グローバルな圧縮閾値を保証する動的で非一様な圧縮方法によって提供されます。しかし、現在の方法は、エラーの単調性などの仮定に基づいて、特定のレイヤーの「重要性」を特定するためにヒューリスティクスに依存しています。つまり、エンドツーエンドのモデル圧縮エラーがレイヤーごとのエラーの合計に比例するというものです。本論文では、この領域を再検討し、与えられた入力範囲で証明された最適な動的圧縮の新しい一般的アプローチを提案します。一般的に、LLMにおいてエラーの単調性が成り立たないことを動機付ける観察から始めます。低いレイヤーごとのエラーの合計を持つ圧縮モデルが、高いエラー合計を持つモデルよりも性能が悪くなる可能性があることに対処するために、EvoPressと呼ばれる動的LLM圧縮の新しい一般的進化フレームワークを提案します。この手法は、証明された収束性と低いサンプルおよび評価の複雑さを持っています。これらの理論的保証が、Llama、Mistral、およびPhiモデルの動的圧縮において非常に競争力のある実用的性能につながることを示します。EvoPressを介して、構造化プルーニング（ブロック/レイヤーの削除）、非構造化の疎行列性、および動的ビット幅の量子化といったすべての圧縮手法において、新たな最先端の結果を示します。弊社のコードはhttps://github.com/IST-DASLab/EvoPressで入手可能です。

English

The high computational costs of large language models (LLMs) have led to a flurry of research on LLM compression, via methods such as quantization, sparsification, or structured pruning. A new frontier in this area is given by dynamic, non-uniform compression methods, which adjust the compression levels (e.g., sparsity) per-block or even per-layer in order to minimize accuracy loss, while guaranteeing a global compression threshold. Yet, current methods rely on heuristics for identifying the "importance" of a given layer towards the loss, based on assumptions such as error monotonicity, i.e. that the end-to-end model compression error is proportional to the sum of layer-wise errors. In this paper, we revisit this area, and propose a new and general approach for dynamic compression that is provably optimal in a given input range. We begin from the motivating observation that, in general, error monotonicity does not hold for LLMs: compressed models with lower sum of per-layer errors can perform worse than models with higher error sums. To address this, we propose a new general evolutionary framework for dynamic LLM compression called EvoPress, which has provable convergence, and low sample and evaluation complexity. We show that these theoretical guarantees lead to highly competitive practical performance for dynamic compression of Llama, Mistral and Phi models. Via EvoPress, we set new state-of-the-art results across all compression approaches: structural pruning (block/layer dropping), unstructured sparsity, as well as quantization with dynamic bitwidths. Our code is available at https://github.com/IST-DASLab/EvoPress.

EvoPress: 進化的探索を通じた最適な動的モデル圧縮に向けて

EvoPress: Towards Optimal Dynamic Model Compression via Evolutionary Search

要旨

Support