DarwinLM：大型語言模型的進化式結構化剪枝

摘要

大型語言模型（LLMs）在多種自然語言處理任務中取得了顯著成功。然而，其龐大的計算成本限制了其廣泛應用，尤其是在即時應用中。結構化剪枝提供了一種有效的解決方案，通過壓縮模型並直接提供端到端的速度提升，無論硬件環境如何。同時，模型的不同組件對剪枝表現出不同的敏感度，這要求進行非均勻的模型壓縮。然而，剪枝方法不僅需要識別出一個有效的子結構，還需要考慮壓縮後的訓練。為此，我們提出了\sysname，一種訓練感知的結構化剪枝方法。\sysname基於進化搜索過程，在每一代中通過變異生成多個子代模型，並選擇最適合的模型進行保留。為了評估訓練後的效果，我們在子代群體中引入了一個輕量級的多步訓練過程，逐步增加訓練數據量，並在每個選擇階段淘汰表現不佳的模型。我們在Llama-2-7B、Llama-3.1-8B和Qwen-2.5-14B-Instruct上進行了廣泛的實驗，驗證了我們的方法，並在結構化剪枝方面達到了最先進的性能。例如，\sysname在壓縮後訓練所需的數據量比ShearedLlama少5倍的情況下，仍超越了其性能。

English

Large Language Models (LLMs) have achieved significant success across various NLP tasks. However, their massive computational costs limit their widespread use, particularly in real-time applications. Structured pruning offers an effective solution by compressing models and directly providing end-to-end speed improvements, regardless of the hardware environment. Meanwhile, different components of the model exhibit varying sensitivities towards pruning, calling for non-uniform model compression. However, a pruning method should not only identify a capable substructure, but also account for post-compression training. To this end, we propose \sysname, a method for training-aware structured pruning. \sysname builds upon an evolutionary search process, generating multiple offspring models in each generation through mutation, and selecting the fittest for survival. To assess the effect of post-training, we incorporate a lightweight, multistep training process within the offspring population, progressively increasing the number of tokens and eliminating poorly performing models in each selection stage. We validate our method through extensive experiments on Llama-2-7B, Llama-3.1-8B and Qwen-2.5-14B-Instruct, achieving state-of-the-art performance for structured pruning. For instance, \sysname surpasses ShearedLlama while requiring 5times less training data during post-compression training.

DarwinLM：大型語言模型的進化式結構化剪枝

DarwinLM: Evolutionary Structured Pruning of Large Language Models

摘要

Support