ChatPaper.aiChatPaper

LLM修剪和蒸餾實踐:Minitron方法

LLM Pruning and Distillation in Practice: The Minitron Approach

August 21, 2024
作者: Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov
cs.AI

摘要

我們提出了一份關於將Llama 3.1 8B和Mistral NeMo 12B模型壓縮為分別具有4B和8B參數的全面報告,使用剪枝和蒸餾技術。我們探索了兩種不同的剪枝策略:(1) 深度剪枝和(2) 聯合隱藏層/注意力/MLP(寬度)剪枝,並在LM評估工具中的常見基準上評估了結果。然後,通過NeMo Aligner對模型進行了對齊,並在指導調整版本中進行了測試。這種方法從Llama 3.1 8B生成了引人入勝的4B模型,並從Mistral NeMo 12B生成了最先進的Mistral-NeMo-Minitron-8B(簡稱為MN-Minitron-8B)模型。我們發現,在沒有訪問原始數據的情況下,對蒸餾數據集上的教師模型進行輕微微調是有益的。我們在Hugging Face上以寬鬆許可證開源我們的基礎模型權重。
English
We present a comprehensive report on compressing the Llama 3.1 8B and Mistral NeMo 12B models to 4B and 8B parameters, respectively, using pruning and distillation. We explore two distinct pruning strategies: (1) depth pruning and (2) joint hidden/attention/MLP (width) pruning, and evaluate the results on common benchmarks from the LM Evaluation Harness. The models are then aligned with NeMo Aligner and tested in instruct-tuned versions. This approach produces a compelling 4B model from Llama 3.1 8B and a state-of-the-art Mistral-NeMo-Minitron-8B (MN-Minitron-8B for brevity) model from Mistral NeMo 12B. We found that with no access to the original data, it is beneficial to slightly fine-tune teacher models on the distillation dataset. We open-source our base model weights on Hugging Face with a permissive license.

Summary

AI-Generated Summary

PDF594November 16, 2024