使用FP8格式进行高效的训练后量化
Efficient Post-training Quantization with FP8 Formats
September 26, 2023
作者: Haihao Shen, Naveen Mellempudi, Xin He, Qun Gao, Chang Wang, Mengni Wang
cs.AI
摘要
最近深度学习方法的进展,如LLMs和扩散模型,已经产生了对改进的量化方法的需求,这些方法能够满足这些现代架构的计算需求,同时保持准确性。为了实现这一目标,我们研究了FP8数据格式在75种独特的网络架构上的优势,涵盖了广泛的任务,包括机器翻译、语言建模、文本生成、图像分类、生成和分割。我们研究了三种不同的FP8表示(E5M2、E4M3和E3M4),以研究在模型准确性上动态范围和精度之间不同程度的权衡对效果的影响。基于我们广泛的研究,我们开发了一个可以泛化到不同网络架构的量化工作流程。我们的实证结果显示,FP8格式在多个方面优于INT8,包括工作负载覆盖率(92.64% vs. 65.87%)、模型准确性和适用于更广泛操作的性能。此外,我们的发现表明,E4M3更适用于自然语言处理模型,而E3M4在计算机视觉任务上略优于E4M3。代码可在Intel神经压缩器的GitHub页面上公开获取:https://github.com/intel/neural-compressor。
English
Recent advances in deep learning methods such as LLMs and Diffusion models
have created a need for improved quantization methods that can meet the
computational demands of these modern architectures while maintaining accuracy.
Towards this goal, we study the advantages of FP8 data formats for
post-training quantization across 75 unique network architectures covering a
wide range of tasks, including machine translation, language modeling, text
generation, image classification, generation, and segmentation. We examine
three different FP8 representations (E5M2, E4M3, and E3M4) to study the effects
of varying degrees of trade-off between dynamic range and precision on model
accuracy. Based on our extensive study, we developed a quantization workflow
that generalizes across different network architectures. Our empirical results
show that FP8 formats outperform INT8 in multiple aspects, including workload
coverage (92.64% vs. 65.87%), model accuracy and suitability for a broader
range of operations. Furthermore, our findings suggest that E4M3 is better
suited for NLP models, whereas E3M4 performs marginally better than E4M3 on
computer vision tasks. The code is publicly available on Intel Neural
Compressor: https://github.com/intel/neural-compressor.