LLM修剪和蒸餾實踐：Minitron方法

摘要

我們提出了一份關於將Llama 3.1 8B和Mistral NeMo 12B模型壓縮為分別具有4B和8B參數的全面報告，使用剪枝和蒸餾技術。我們探索了兩種不同的剪枝策略：(1) 深度剪枝和(2) 聯合隱藏層/注意力/MLP（寬度）剪枝，並在LM評估工具中的常見基準上評估了結果。然後，通過NeMo Aligner對模型進行了對齊，並在指導調整版本中進行了測試。這種方法從Llama 3.1 8B生成了引人入勝的4B模型，並從Mistral NeMo 12B生成了最先進的Mistral-NeMo-Minitron-8B（簡稱為MN-Minitron-8B）模型。我們發現，在沒有訪問原始數據的情況下，對蒸餾數據集上的教師模型進行輕微微調是有益的。我們在Hugging Face上以寬鬆許可證開源我們的基礎模型權重。

English

We present a comprehensive report on compressing the Llama 3.1 8B and Mistral NeMo 12B models to 4B and 8B parameters, respectively, using pruning and distillation. We explore two distinct pruning strategies: (1) depth pruning and (2) joint hidden/attention/MLP (width) pruning, and evaluate the results on common benchmarks from the LM Evaluation Harness. The models are then aligned with NeMo Aligner and tested in instruct-tuned versions. This approach produces a compelling 4B model from Llama 3.1 8B and a state-of-the-art Mistral-NeMo-Minitron-8B (MN-Minitron-8B for brevity) model from Mistral NeMo 12B. We found that with no access to the original data, it is beneficial to slightly fine-tune teacher models on the distillation dataset. We open-source our base model weights on Hugging Face with a permissive license.

LLM修剪和蒸餾實踐：Minitron方法

LLM Pruning and Distillation in Practice: The Minitron Approach

摘要

Support