프루닝과 지식 증류를 통한 컴팩트 언어 모델

초록

다양한 배포 규모와 크기를 대상으로 하는 대형 언어 모델(LLM)은 현재 각 변형을 처음부터 훈련시켜 생산되고 있으며, 이는 매우 높은 계산 자원을 요구합니다. 본 논문에서는 기존 LLM을 가지치기(pruning)한 후 원래 훈련 데이터의 일부(<3%)로 재훈련하는 것이 반복적인 전체 재훈련에 대한 적절한 대안이 될 수 있는지 조사합니다. 이를 위해, 우리는 깊이, 너비, 어텐션 및 MLP 가지치기를 지식 증류(knowledge distillation) 기반 재훈련과 결합한 LLM 압축을 위한 실용적이고 효과적인 최적의 방법론을 개발했습니다. 우리는 각 축에 대한 가지치기 전략, 축을 결합하는 방법, 증류 전략, 그리고 최적의 압축 아키텍처를 도출하기 위한 탐색 기법에 대한 상세한 실험적 탐구를 통해 이러한 최적의 방법론을 도출했습니다. 이 가이드를 사용하여 Nemotron-4 LLM 패밀리를 2-4배 압축하고, 다양한 언어 모델링 작업에서 유사한 크기의 모델들과 성능을 비교했습니다. 이미 사전 훈련된 15B 모델에서 우리의 접근법을 사용하여 8B 및 4B 모델을 도출하는 것은 처음부터 훈련하는 것에 비해 모델당 최대 40배 적은 훈련 토큰을 필요로 하며, 이는 전체 모델 패밀리(15B, 8B, 4B)를 훈련하는 데 1.8배의 계산 비용 절감을 가져옵니다. Minitron 모델은 처음부터 훈련한 것에 비해 MMLU 점수에서 최대 16%의 성능 향상을 보이며, Mistral 7B, Gemma 7B, Llama-3 8B와 같은 다른 커뮤니티 모델들과 비슷한 성능을 보이고, 문헌에서 최신 압축 기술을 능가합니다. 우리는 Huggingface에 Minitron 모델 가중치를 오픈소스로 공개했으며, GitHub에는 예제 코드를 포함한 보조 자료를 제공합니다.

English

Large language models (LLMs) targeting different deployment scales and sizes are currently produced by training each variant from scratch; this is extremely compute-intensive. In this paper, we investigate if pruning an existing LLM and then re-training it with a fraction (<3%) of the original training data can be a suitable alternative to repeated, full retraining. To this end, we develop a set of practical and effective compression best practices for LLMs that combine depth, width, attention and MLP pruning with knowledge distillation-based retraining; we arrive at these best practices through a detailed empirical exploration of pruning strategies for each axis, methods to combine axes, distillation strategies, and search techniques for arriving at optimal compressed architectures. We use this guide to compress the Nemotron-4 family of LLMs by a factor of 2-4x, and compare their performance to similarly-sized models on a variety of language modeling tasks. Deriving 8B and 4B models from an already pretrained 15B model using our approach requires up to 40x fewer training tokens per model compared to training from scratch; this results in compute cost savings of 1.8x for training the full model family (15B, 8B, and 4B). Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch, perform comparably to other community models such as Mistral 7B, Gemma 7B and Llama-3 8B, and outperform state-of-the-art compression techniques from the literature. We have open-sourced Minitron model weights on Huggingface, with corresponding supplementary material including example code available on GitHub.

프루닝과 지식 증류를 통한 컴팩트 언어 모델

Compact Language Models via Pruning and Knowledge Distillation

초록

Support