通过修剪和知识蒸馏实现紧凑型语言模型

摘要

目前，针对不同部署规模和大小的大型语言模型（LLMs）通常是通过从头开始训练每个变体来生产的；这需要极大的计算资源。本文研究了对现有LLM进行修剪，然后用原始训练数据的一小部分（<3%）重新训练是否可以作为重复完整重新训练的合适替代方案。为此，我们开发了一套实用且有效的LLM压缩最佳实践，结合了深度、宽度、注意力和MLP修剪以及基于知识蒸馏的重新训练；我们通过对每个轴的修剪策略、轴的组合方法、蒸馏策略以及搜索技术的详细实证探索得出了这些最佳实践，以找到最佳压缩架构。我们使用这一指南将Nemotron-4系列的LLM压缩了2-4倍，并将它们的性能与各种语言建模任务中大小相似的模型进行了比较。使用我们的方法从已经预训练的15B模型派生8B和4B模型，每个模型所需的训练标记数量比从头开始训练少多达40倍；这导致训练完整模型系列（15B、8B和4B）的计算成本节约了1.8倍。Minitron模型相比从头开始训练，MMLU分数提高了多达16%，性能与其他社区模型（如Mistral 7B、Gemma 7B和Llama-3 8B）相当，并且优于文献中的最先进的压缩技术。我们已在Huggingface上开源了Minitron模型权重，并提供了相关的补充材料，包括在GitHub上提供的示例代码。

English

Large language models (LLMs) targeting different deployment scales and sizes are currently produced by training each variant from scratch; this is extremely compute-intensive. In this paper, we investigate if pruning an existing LLM and then re-training it with a fraction (<3%) of the original training data can be a suitable alternative to repeated, full retraining. To this end, we develop a set of practical and effective compression best practices for LLMs that combine depth, width, attention and MLP pruning with knowledge distillation-based retraining; we arrive at these best practices through a detailed empirical exploration of pruning strategies for each axis, methods to combine axes, distillation strategies, and search techniques for arriving at optimal compressed architectures. We use this guide to compress the Nemotron-4 family of LLMs by a factor of 2-4x, and compare their performance to similarly-sized models on a variety of language modeling tasks. Deriving 8B and 4B models from an already pretrained 15B model using our approach requires up to 40x fewer training tokens per model compared to training from scratch; this results in compute cost savings of 1.8x for training the full model family (15B, 8B, and 4B). Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch, perform comparably to other community models such as Mistral 7B, Gemma 7B and Llama-3 8B, and outperform state-of-the-art compression techniques from the literature. We have open-sourced Minitron model weights on Huggingface, with corresponding supplementary material including example code available on GitHub.

通过修剪和知识蒸馏实现紧凑型语言模型

Compact Language Models via Pruning and Knowledge Distillation

摘要

Support