FP8-LM: 训练FP8大型语言模型

摘要

本文探讨了用于高效训练大型语言模型（LLMs）的FP8低比特数据格式。我们的关键见解是，在LLM训练中，大多数变量（如梯度和优化器状态）可以使用低精度数据格式，而不会影响模型准确性，并且无需更改超参数。具体而言，我们提出了一种新的FP8自动混合精度框架，用于训练LLMs。该框架提供三个FP8利用级别，以简化LLMs的混合精度和分布式并行训练。它逐步以增量方式整合8位梯度、优化器状态和分布式学习。实验结果表明，在H100 GPU平台上训练GPT-175B模型期间，我们的FP8混合精度训练框架不仅实现了实际内存使用量的显着减少42%，而且比广泛采用的BF16框架（即Megatron-LM）快64%，超过了Nvidia Transformer Engine 17%的速度。这在很大程度上降低了大型基础模型的训练成本。此外，我们的FP8混合精度训练方法是通用的。它可以无缝应用于其他任务，如LLM指令调整和带有人类反馈的强化学习，从而节省微调费用。我们的FP8低精度训练框架已在{https://github.com/Azure/MS-AMP}{aka.ms/MS.AMP}开源。

English

In this paper, we explore FP8 low-bit data formats for efficient training of large language models (LLMs). Our key insight is that most variables, such as gradients and optimizer states, in LLM training can employ low-precision data formats without compromising model accuracy and requiring no changes to hyper-parameters. Specifically, we propose a new FP8 automatic mixed-precision framework for training LLMs. This framework offers three levels of FP8 utilization to streamline mixed-precision and distributed parallel training for LLMs. It gradually incorporates 8-bit gradients, optimizer states, and distributed learning in an incremental manner. Experiment results show that, during the training of GPT-175B model on H100 GPU platform, our FP8 mixed-precision training framework not only achieved a remarkable 42% reduction in real memory usage but also ran 64% faster than the widely adopted BF16 framework (i.e., Megatron-LM), surpassing the speed of Nvidia Transformer Engine by 17%. This largely reduces the training costs for large foundation models. Furthermore, our FP8 mixed-precision training methodology is generic. It can be seamlessly applied to other tasks such as LLM instruction tuning and reinforcement learning with human feedback, offering savings in fine-tuning expenses. Our FP8 low-precision training framework is open-sourced at {https://github.com/Azure/MS-AMP}{aka.ms/MS.AMP}.

FP8-LM: 训练FP8大型语言模型

FP8-LM: Training FP8 Large Language Models

摘要

Support