EvoPress：通过进化搜索实现最佳动态模型压缩

摘要

大型语言模型（LLMs）的高计算成本导致了对LLM压缩的研究热潮，通过量化、稀疏化或结构化剪枝等方法。该领域的一个新前沿是动态、非均匀压缩方法，可以根据每个块甚至每个层调整压缩级别（如稀疏度），以最小化精度损失，同时保证全局压缩阈值。然而，当前方法依赖于启发式方法来确定给定层对损失的“重要性”，基于诸如误差单调性的假设，即端到端模型压缩误差与逐层误差之和成正比。在本文中，我们重新审视了这一领域，并提出了一种新的通用动态压缩方法，可以在给定输入范围内被证明是最优的。我们从这样一个激励观察开始，即一般情况下，LLMs并不具有误差单调性：具有较低逐层误差总和的压缩模型可能表现比具有较高误差总和的模型更差。为了解决这个问题，我们提出了一个名为EvoPress的新通用进化框架，用于动态LLM压缩，具有可证明的收敛性，以及低样本和评估复杂度。我们展示这些理论保证导致了EvoPress在Llama、Mistral和Phi模型的动态压缩方面具有极具竞争力的实际性能。通过EvoPress，我们在所有压缩方法（结构化剪枝、块/层丢弃、非结构化稀疏性，以及具有动态比特宽度的量化）中取得了新的最先进结果。我们的代码可在https://github.com/IST-DASLab/EvoPress找到。

English

The high computational costs of large language models (LLMs) have led to a flurry of research on LLM compression, via methods such as quantization, sparsification, or structured pruning. A new frontier in this area is given by dynamic, non-uniform compression methods, which adjust the compression levels (e.g., sparsity) per-block or even per-layer in order to minimize accuracy loss, while guaranteeing a global compression threshold. Yet, current methods rely on heuristics for identifying the "importance" of a given layer towards the loss, based on assumptions such as error monotonicity, i.e. that the end-to-end model compression error is proportional to the sum of layer-wise errors. In this paper, we revisit this area, and propose a new and general approach for dynamic compression that is provably optimal in a given input range. We begin from the motivating observation that, in general, error monotonicity does not hold for LLMs: compressed models with lower sum of per-layer errors can perform worse than models with higher error sums. To address this, we propose a new general evolutionary framework for dynamic LLM compression called EvoPress, which has provable convergence, and low sample and evaluation complexity. We show that these theoretical guarantees lead to highly competitive practical performance for dynamic compression of Llama, Mistral and Phi models. Via EvoPress, we set new state-of-the-art results across all compression approaches: structural pruning (block/layer dropping), unstructured sparsity, as well as quantization with dynamic bitwidths. Our code is available at https://github.com/IST-DASLab/EvoPress.

EvoPress：通过进化搜索实现最佳动态模型压缩

EvoPress: Towards Optimal Dynamic Model Compression via Evolutionary Search

摘要

Support