ChatPaper.aiChatPaper

FP8-LM: 训练FP8大型语言模型

FP8-LM: Training FP8 Large Language Models

October 27, 2023
作者: Houwen Peng, Kan Wu, Yixuan Wei, Guoshuai Zhao, Yuxiang Yang, Ze Liu, Yifan Xiong, Ziyue Yang, Bolin Ni, Jingcheng Hu, Ruihang Li, Miaosen Zhang, Chen Li, Jia Ning, Ruizhe Wang, Zheng Zhang, Shuguang Liu, Joe Chau, Han Hu, Peng Cheng
cs.AI

摘要

本文探讨了用于高效训练大型语言模型(LLMs)的FP8低比特数据格式。我们的关键见解是,在LLM训练中,大多数变量(如梯度和优化器状态)可以使用低精度数据格式,而不会影响模型准确性,并且无需更改超参数。具体而言,我们提出了一种新的FP8自动混合精度框架,用于训练LLMs。该框架提供三个FP8利用级别,以简化LLMs的混合精度和分布式并行训练。它逐步以增量方式整合8位梯度、优化器状态和分布式学习。实验结果表明,在H100 GPU平台上训练GPT-175B模型期间,我们的FP8混合精度训练框架不仅实现了实际内存使用量的显着减少42%,而且比广泛采用的BF16框架(即Megatron-LM)快64%,超过了Nvidia Transformer Engine 17%的速度。这在很大程度上降低了大型基础模型的训练成本。此外,我们的FP8混合精度训练方法是通用的。它可以无缝应用于其他任务,如LLM指令调整和带有人类反馈的强化学习,从而节省微调费用。我们的FP8低精度训练框架已在{https://github.com/Azure/MS-AMP}{aka.ms/MS.AMP}开源。
English
In this paper, we explore FP8 low-bit data formats for efficient training of large language models (LLMs). Our key insight is that most variables, such as gradients and optimizer states, in LLM training can employ low-precision data formats without compromising model accuracy and requiring no changes to hyper-parameters. Specifically, we propose a new FP8 automatic mixed-precision framework for training LLMs. This framework offers three levels of FP8 utilization to streamline mixed-precision and distributed parallel training for LLMs. It gradually incorporates 8-bit gradients, optimizer states, and distributed learning in an incremental manner. Experiment results show that, during the training of GPT-175B model on H100 GPU platform, our FP8 mixed-precision training framework not only achieved a remarkable 42% reduction in real memory usage but also ran 64% faster than the widely adopted BF16 framework (i.e., Megatron-LM), surpassing the speed of Nvidia Transformer Engine by 17%. This largely reduces the training costs for large foundation models. Furthermore, our FP8 mixed-precision training methodology is generic. It can be seamlessly applied to other tasks such as LLM instruction tuning and reinforcement learning with human feedback, offering savings in fine-tuning expenses. Our FP8 low-precision training framework is open-sourced at {https://github.com/Azure/MS-AMP}{aka.ms/MS.AMP}.
PDF332December 15, 2024