使用FP8格式進行高效的後訓練量化

摘要

近年來深度學習方法的進步，如LLMs和擴散模型，已經創造了對改進量化方法的需求，以滿足這些現代架構的計算需求，同時保持準確性。為了達到這個目標，我們研究了FP8數據格式在75種獨特的網絡架構上的優勢，涵蓋了廣泛的任務，包括機器翻譯、語言建模、文本生成、圖像分類、生成和分割。我們檢驗了三種不同的FP8表示（E5M2、E4M3和E3M4），以研究在模型準確性上在動態範圍和精度之間不同程度的權衡對效果的影響。根據我們的廣泛研究，我們開發了一個可以應用於不同網絡架構的量化工作流程。我們的實證結果顯示，FP8格式在多個方面優於INT8，包括工作負載覆蓋率（92.64% vs. 65.87%）、模型準確性和適用於更廣泛操作的性能。此外，我們的研究結果表明，E4M3更適合NLP模型，而E3M4在計算機視覺任務上比E4M3稍微更好。代碼可在Intel神經壓縮器的GitHub頁面上公開獲取：https://github.com/intel/neural-compressor。

English

Recent advances in deep learning methods such as LLMs and Diffusion models have created a need for improved quantization methods that can meet the computational demands of these modern architectures while maintaining accuracy. Towards this goal, we study the advantages of FP8 data formats for post-training quantization across 75 unique network architectures covering a wide range of tasks, including machine translation, language modeling, text generation, image classification, generation, and segmentation. We examine three different FP8 representations (E5M2, E4M3, and E3M4) to study the effects of varying degrees of trade-off between dynamic range and precision on model accuracy. Based on our extensive study, we developed a quantization workflow that generalizes across different network architectures. Our empirical results show that FP8 formats outperform INT8 in multiple aspects, including workload coverage (92.64% vs. 65.87%), model accuracy and suitability for a broader range of operations. Furthermore, our findings suggest that E4M3 is better suited for NLP models, whereas E3M4 performs marginally better than E4M3 on computer vision tasks. The code is publicly available on Intel Neural Compressor: https://github.com/intel/neural-compressor.

使用FP8格式進行高效的後訓練量化

Efficient Post-training Quantization with FP8 Formats

摘要

Support