LLM修剪与蒸馏实践:Minitron方法
LLM Pruning and Distillation in Practice: The Minitron Approach
August 21, 2024
作者: Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov
cs.AI
摘要
我们提出了一份关于使用剪枝和蒸馏技术,将Llama 3.1 8B和Mistral NeMo 12B模型压缩至4B和8B参数的全面报告。我们探讨了两种不同的剪枝策略:(1)深度剪枝和(2)联合隐藏层/注意力/MLP(宽度)剪枝,并在LM评估工具中对结果进行评估。然后,我们使用NeMo Aligner对模型进行对齐,并在经过指导微调的版本中进行测试。这种方法从Llama 3.1 8B生成了一个引人注目的4B模型,并从Mistral NeMo 12B生成了一流的Mistral-NeMo-Minitron-8B(简称MN-Minitron-8B)模型。我们发现,在没有访问原始数据的情况下,对蒸馏数据集上的教师模型进行轻微微调是有益的。我们在Hugging Face上以一种宽松的许可证开源了我们的基础模型权重。
English
We present a comprehensive report on compressing the Llama 3.1 8B and Mistral
NeMo 12B models to 4B and 8B parameters, respectively, using pruning and
distillation. We explore two distinct pruning strategies: (1) depth pruning and
(2) joint hidden/attention/MLP (width) pruning, and evaluate the results on
common benchmarks from the LM Evaluation Harness. The models are then aligned
with NeMo Aligner and tested in instruct-tuned versions. This approach produces
a compelling 4B model from Llama 3.1 8B and a state-of-the-art
Mistral-NeMo-Minitron-8B (MN-Minitron-8B for brevity) model from Mistral NeMo
12B. We found that with no access to the original data, it is beneficial to
slightly fine-tune teacher models on the distillation dataset. We open-source
our base model weights on Hugging Face with a permissive license.Summary
AI-Generated Summary