FP8-LM：訓練FP8大型語言模型

摘要

本文探討了用於高效訓練大型語言模型（LLMs）的FP8低位數據格式。我們的關鍵見解是，在LLM訓練中，大多數變量（如梯度和優化器狀態）可以使用低精度數據格式，而不會影響模型準確性，也無需更改超參數。具體而言，我們提出了一個新的FP8自動混合精度框架，用於LLMs的訓練。該框架提供三個FP8利用級別，以簡化LLMs的混合精度和分佈式並行訓練。它逐步以增量方式將8位梯度、優化器狀態和分佈式學習納入其中。實驗結果顯示，在H100 GPU平台上訓練GPT-175B模型期間，我們的FP8混合精度訓練框架不僅實現了顯著的42%實際內存使用量減少，而且比廣泛採用的BF16框架（即Megatron-LM）運行速度快64%，超越了Nvidia Transformer Engine 17%的速度。這在很大程度上降低了大型基礎模型的訓練成本。此外，我們的FP8混合精度訓練方法是通用的。它可以無縫應用於其他任務，如LLM指令調整和帶有人類反饋的強化學習，從而節省微調費用。我們的FP8低精度訓練框架已在{https://github.com/Azure/MS-AMP}{aka.ms/MS.AMP}上開源。

English

In this paper, we explore FP8 low-bit data formats for efficient training of large language models (LLMs). Our key insight is that most variables, such as gradients and optimizer states, in LLM training can employ low-precision data formats without compromising model accuracy and requiring no changes to hyper-parameters. Specifically, we propose a new FP8 automatic mixed-precision framework for training LLMs. This framework offers three levels of FP8 utilization to streamline mixed-precision and distributed parallel training for LLMs. It gradually incorporates 8-bit gradients, optimizer states, and distributed learning in an incremental manner. Experiment results show that, during the training of GPT-175B model on H100 GPU platform, our FP8 mixed-precision training framework not only achieved a remarkable 42% reduction in real memory usage but also ran 64% faster than the widely adopted BF16 framework (i.e., Megatron-LM), surpassing the speed of Nvidia Transformer Engine by 17%. This largely reduces the training costs for large foundation models. Furthermore, our FP8 mixed-precision training methodology is generic. It can be seamlessly applied to other tasks such as LLM instruction tuning and reinforcement learning with human feedback, offering savings in fine-tuning expenses. Our FP8 low-precision training framework is open-sourced at {https://github.com/Azure/MS-AMP}{aka.ms/MS.AMP}.

FP8-LM：訓練FP8大型語言模型

FP8-LM: Training FP8 Large Language Models

摘要

Support