LLMの剪定と蒸留の実践：ミニトロンアプローチ

要旨

Llama 3.1 8BおよびMistral NeMo 12Bモデルをそれぞれ4Bおよび8Bのパラメータに圧縮するための剪定と蒸留を用いた包括的なレポートを提供します。我々は2つの異なる剪定戦略、すなわち(1)深さ剪定と(2)共通のベンチマークデータで結果を評価する隠れ層/注意機構/MLP（幅）剪定を探求します。その後、NeMo Alignerでモデルを整列させ、instruct-tunedバージョンでテストします。このアプローチにより、Llama 3.1 8Bから魅力的な4Bモデルが生成され、Mistral NeMo 12Bから最先端のMistral-NeMo-Minitron-8B（MN-Minitron-8Bと略す）モデルが生成されます。我々は、元のデータにアクセスできない場合、蒸留データセットで教師モデルをわずかに微調整することが有益であることを見出しました。我々は、Hugging Faceでベースモデルの重みをオープンソース化し、許諾されたライセンスで提供します。

English

We present a comprehensive report on compressing the Llama 3.1 8B and Mistral NeMo 12B models to 4B and 8B parameters, respectively, using pruning and distillation. We explore two distinct pruning strategies: (1) depth pruning and (2) joint hidden/attention/MLP (width) pruning, and evaluate the results on common benchmarks from the LM Evaluation Harness. The models are then aligned with NeMo Aligner and tested in instruct-tuned versions. This approach produces a compelling 4B model from Llama 3.1 8B and a state-of-the-art Mistral-NeMo-Minitron-8B (MN-Minitron-8B for brevity) model from Mistral NeMo 12B. We found that with no access to the original data, it is beneficial to slightly fine-tune teacher models on the distillation dataset. We open-source our base model weights on Hugging Face with a permissive license.

LLMの剪定と蒸留の実践：ミニトロンアプローチ

LLM Pruning and Distillation in Practice: The Minitron Approach

要旨

Support