透過修剪和知識蒸餾實現緊湊語言模型

摘要

目前，針對不同部署規模和大小的大型語言模型（LLMs）通常是通過從頭開始訓練每個變體來製作；這是非常消耗計算資源的。在本文中，我們研究了對現有的LLM進行修剪，然後使用原始訓練數據的一小部分（<3%）重新訓練是否可以作為重複完整重新訓練的合適替代方法。為此，我們開發了一套實用且有效的LLM壓縮最佳實踐，結合了深度、寬度、注意力和MLP修剪，以及基於知識蒸餾的重新訓練；通過對每個軸的修剪策略、軸的組合方法、蒸餾策略和搜索技術的詳細實證探索，我們得出了這些最佳實踐。我們使用這個指南將Nemotron-4系列的LLMs壓縮2-4倍，並將它們的性能與各種語言建模任務中大小相似的模型進行比較。使用我們的方法從已預訓練的15B模型中獲取8B和4B模型，相較於從頭開始訓練，每個模型所需的訓練標記減少了多達40倍；這導致訓練完整模型系列（15B、8B和4B）的計算成本節省了1.8倍。Minitron模型的MMLU分數比從頭開始訓練提高了多達16%，在性能上與其他社區模型（如Mistral 7B、Gemma 7B和Llama-3 8B）相當，並優於文獻中的最新壓縮技術。我們在Huggingface上開源了Minitron模型權重，並提供了相應的補充材料，包括在GitHub上提供的示例代碼。

English

Large language models (LLMs) targeting different deployment scales and sizes are currently produced by training each variant from scratch; this is extremely compute-intensive. In this paper, we investigate if pruning an existing LLM and then re-training it with a fraction (<3%) of the original training data can be a suitable alternative to repeated, full retraining. To this end, we develop a set of practical and effective compression best practices for LLMs that combine depth, width, attention and MLP pruning with knowledge distillation-based retraining; we arrive at these best practices through a detailed empirical exploration of pruning strategies for each axis, methods to combine axes, distillation strategies, and search techniques for arriving at optimal compressed architectures. We use this guide to compress the Nemotron-4 family of LLMs by a factor of 2-4x, and compare their performance to similarly-sized models on a variety of language modeling tasks. Deriving 8B and 4B models from an already pretrained 15B model using our approach requires up to 40x fewer training tokens per model compared to training from scratch; this results in compute cost savings of 1.8x for training the full model family (15B, 8B, and 4B). Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch, perform comparably to other community models such as Mistral 7B, Gemma 7B and Llama-3 8B, and outperform state-of-the-art compression techniques from the literature. We have open-sourced Minitron model weights on Huggingface, with corresponding supplementary material including example code available on GitHub.

透過修剪和知識蒸餾實現緊湊語言模型

Compact Language Models via Pruning and Knowledge Distillation

摘要

Support