GPTailor：通过层级裁剪与拼接实现大规模语言模型剪枝

摘要

大型语言模型（LLMs）在语言理解和生成方面展现了卓越的能力。然而，这种令人印象深刻的能力通常伴随着庞大的模型规模，这给部署和推理带来了重大挑战。虽然模型参数的结构化剪枝为降低部署时的计算成本提供了一种有前景的方法，但当前的方法主要集中在单一模型的剪枝上。在本研究中，我们开发了一种新颖的策略，通过战略性地组合或合并来自微调模型变体的层来压缩模型，从而通过聚合在不同微调中突出的能力来保留原始模型的能力。我们将这些LLMs的最优定制视为一个零阶优化问题，采用了一个支持三种不同操作的搜索空间：（1）层移除，（2）从不同候选模型中选择层，以及（3）层合并。我们的实验表明，这种方法在模型剪枝方面具有竞争力，例如，对于Llama2-13B模型家族，我们的压缩模型在移除约25%参数的同时，保持了大约97.3%的原始性能，显著优于之前的最先进方法。代码可在https://github.com/Guinan-Su/auto-merge-llm获取。

English

Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. However, such impressive capability typically comes with a substantial model size, which presents significant challenges in deployment and inference. While structured pruning of model parameters offers a promising way to reduce computational costs at deployment time, current methods primarily focus on single model pruning. In this work, we develop a novel strategy to compress models by strategically combining or merging layers from finetuned model variants, which preserves the original model's abilities by aggregating capabilities accentuated in different finetunes. We pose the optimal tailoring of these LLMs as a zero-order optimization problem, adopting a search space that supports three different operations: (1) Layer removal, (2) Layer selection from different candidate models, and (3) Layer merging. Our experiments demonstrate that this approach leads to competitive model pruning, for example, for the Llama2-13B model families, our compressed models maintain approximately 97.3\% of the original performance while removing sim25% of parameters, significantly outperforming previous state-of-the-art methods. The code is available at https://github.com/Guinan-Su/auto-merge-llm.

GPTailor：通过层级裁剪与拼接实现大规模语言模型剪枝

GPTailor: Large Language Model Pruning Through Layer Cutting and Stitching

摘要

Support