Quantizzazione Efficiente Post-Addestramento con Formati FP8

Abstract

I recenti progressi nei metodi di deep learning, come i modelli LLM e i modelli di diffusione, hanno creato la necessità di migliorare i metodi di quantizzazione in grado di soddisfare le esigenze computazionali di queste moderne architetture mantenendo al contempo l'accuratezza. Verso questo obiettivo, studiamo i vantaggi dei formati di dati FP8 per la quantizzazione post-addestramento su 75 architetture di rete uniche, coprendo un'ampia gamma di task, tra cui traduzione automatica, modellazione del linguaggio, generazione di testo, classificazione di immagini, generazione e segmentazione. Esaminiamo tre diverse rappresentazioni FP8 (E5M2, E4M3 ed E3M4) per studiare gli effetti di diversi gradi di compromesso tra intervallo dinamico e precisione sull'accuratezza del modello. Sulla base del nostro ampio studio, abbiamo sviluppato un flusso di lavoro di quantizzazione che si generalizza su diverse architetture di rete. I nostri risultati empirici mostrano che i formati FP8 superano INT8 in molteplici aspetti, tra cui copertura del carico di lavoro (92,64% vs. 65,87%), accuratezza del modello e idoneità per un'ampia gamma di operazioni. Inoltre, i nostri risultati suggeriscono che E4M3 è più adatto per i modelli NLP, mentre E3M4 performa leggermente meglio di E4M3 sui task di computer vision. Il codice è pubblicamente disponibile su Intel Neural Compressor: https://github.com/intel/neural-compressor.

English

Recent advances in deep learning methods such as LLMs and Diffusion models have created a need for improved quantization methods that can meet the computational demands of these modern architectures while maintaining accuracy. Towards this goal, we study the advantages of FP8 data formats for post-training quantization across 75 unique network architectures covering a wide range of tasks, including machine translation, language modeling, text generation, image classification, generation, and segmentation. We examine three different FP8 representations (E5M2, E4M3, and E3M4) to study the effects of varying degrees of trade-off between dynamic range and precision on model accuracy. Based on our extensive study, we developed a quantization workflow that generalizes across different network architectures. Our empirical results show that FP8 formats outperform INT8 in multiple aspects, including workload coverage (92.64% vs. 65.87%), model accuracy and suitability for a broader range of operations. Furthermore, our findings suggest that E4M3 is better suited for NLP models, whereas E3M4 performs marginally better than E4M3 on computer vision tasks. The code is publicly available on Intel Neural Compressor: https://github.com/intel/neural-compressor.

Quantizzazione Efficiente Post-Addestramento con Formati FP8

Efficient Post-training Quantization with FP8 Formats

Abstract

Support