GPTailor：基於層級裁剪與縫合的大型語言模型精簡技術

摘要

大型語言模型（LLMs）在語言理解與生成方面展現了卓越的能力。然而，這種令人印象深刻的性能通常伴隨著龐大的模型規模，這在部署與推理階段帶來了顯著的挑戰。雖然模型參數的結構化修剪提供了一種在部署時降低計算成本的有前景方法，但現有方法主要集中於單一模型的修剪。在本研究中，我們開發了一種新策略，通過策略性地結合或合併來自微調模型變體的層來壓縮模型，這種方法通過聚合在不同微調中突出的能力，保留了原始模型的功能。我們將這些LLMs的最優裁剪視為一個零階優化問題，採用了一個支持三種不同操作的搜索空間：（1）層移除，（2）從不同候選模型中選擇層，以及（3）層合併。我們的實驗表明，這種方法實現了競爭性的模型修剪，例如，對於Llama2-13B模型家族，我們的壓縮模型在移除約25%參數的同時，保持了約97.3%的原始性能，顯著超越了先前的最先進方法。相關代碼可在https://github.com/Guinan-Su/auto-merge-llm獲取。

English

Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. However, such impressive capability typically comes with a substantial model size, which presents significant challenges in deployment and inference. While structured pruning of model parameters offers a promising way to reduce computational costs at deployment time, current methods primarily focus on single model pruning. In this work, we develop a novel strategy to compress models by strategically combining or merging layers from finetuned model variants, which preserves the original model's abilities by aggregating capabilities accentuated in different finetunes. We pose the optimal tailoring of these LLMs as a zero-order optimization problem, adopting a search space that supports three different operations: (1) Layer removal, (2) Layer selection from different candidate models, and (3) Layer merging. Our experiments demonstrate that this approach leads to competitive model pruning, for example, for the Llama2-13B model families, our compressed models maintain approximately 97.3\% of the original performance while removing sim25% of parameters, significantly outperforming previous state-of-the-art methods. The code is available at https://github.com/Guinan-Su/auto-merge-llm.

GPTailor：基於層級裁剪與縫合的大型語言模型精簡技術

GPTailor: Large Language Model Pruning Through Layer Cutting and Stitching

摘要

Support