FP8-LM: Addestramento di Modelli Linguistici di Grande Dimensione in FP8

Abstract

In questo articolo, esploriamo i formati di dati a basso bit FP8 per l'addestramento efficiente di modelli linguistici di grandi dimensioni (LLM). La nostra intuizione chiave è che la maggior parte delle variabili, come i gradienti e gli stati dell'ottimizzatore, nell'addestramento degli LLM possono utilizzare formati di dati a bassa precisione senza compromettere l'accuratezza del modello e senza richiedere modifiche agli iperparametri. Nello specifico, proponiamo un nuovo framework automatico a precisione mista FP8 per l'addestramento degli LLM. Questo framework offre tre livelli di utilizzo dell'FP8 per semplificare l'addestramento a precisione mista e parallelo distribuito per gli LLM. Incorpora gradualmente gradienti a 8 bit, stati dell'ottimizzatore e apprendimento distribuito in modo incrementale. I risultati degli esperimenti mostrano che, durante l'addestramento del modello GPT-175B sulla piattaforma GPU H100, il nostro framework di addestramento a precisione mista FP8 non solo ha ottenuto una riduzione significativa del 42% nell'uso effettivo della memoria, ma ha anche funzionato il 64% più velocemente rispetto al framework BF16 ampiamente adottato (ovvero Megatron-LM), superando la velocità di Nvidia Transformer Engine del 17%. Ciò riduce notevolmente i costi di addestramento per i grandi modelli di base. Inoltre, la nostra metodologia di addestramento a precisione mista FP8 è generica. Può essere applicata senza soluzione di continuità ad altre attività come il tuning delle istruzioni degli LLM e l'apprendimento per rinforzo con feedback umano, offrendo risparmi nei costi di fine-tuning. Il nostro framework di addestramento a bassa precisione FP8 è open-source all'indirizzo {https://github.com/Azure/MS-AMP}{aka.ms/MS.AMP}.

English

In this paper, we explore FP8 low-bit data formats for efficient training of large language models (LLMs). Our key insight is that most variables, such as gradients and optimizer states, in LLM training can employ low-precision data formats without compromising model accuracy and requiring no changes to hyper-parameters. Specifically, we propose a new FP8 automatic mixed-precision framework for training LLMs. This framework offers three levels of FP8 utilization to streamline mixed-precision and distributed parallel training for LLMs. It gradually incorporates 8-bit gradients, optimizer states, and distributed learning in an incremental manner. Experiment results show that, during the training of GPT-175B model on H100 GPU platform, our FP8 mixed-precision training framework not only achieved a remarkable 42% reduction in real memory usage but also ran 64% faster than the widely adopted BF16 framework (i.e., Megatron-LM), surpassing the speed of Nvidia Transformer Engine by 17%. This largely reduces the training costs for large foundation models. Furthermore, our FP8 mixed-precision training methodology is generic. It can be seamlessly applied to other tasks such as LLM instruction tuning and reinforcement learning with human feedback, offering savings in fine-tuning expenses. Our FP8 low-precision training framework is open-sourced at {https://github.com/Azure/MS-AMP}{aka.ms/MS.AMP}.

FP8-LM: Addestramento di Modelli Linguistici di Grande Dimensione in FP8

FP8-LM: Training FP8 Large Language Models

Abstract

Support