ChatPaper.aiChatPaper

Quartet:原生FP4訓練對於大型語言模型而言可能是最佳選擇

Quartet: Native FP4 Training Can Be Optimal for Large Language Models

May 20, 2025
作者: Roberto L. Castro, Andrei Panferov, Soroush Tabesh, Oliver Sieberling, Jiale Chen, Mahdi Nikdan, Saleh Ashkboos, Dan Alistarh
cs.AI

摘要

大型語言模型(LLMs)的快速發展伴隨著計算需求的空前增長,頂尖模型的訓練成本每幾個月便翻倍。直接在低精度算術中訓練模型提供了一種解決方案,既能提升計算吞吐量,又能提高能源效率。具體而言,NVIDIA 最新的 Blackwell 架構支持極低精度運算,特別是 FP4 變體,承諾帶來顯著的效率提升。然而,當前在 FP4 精度下訓練 LLMs 的算法面臨顯著的精度下降,且往往依賴於混合精度備用方案。本文中,我們系統性地研究了硬件支持的 FP4 訓練,並引入了 Quartet,這是一種新方法,能夠實現精確的端到端 FP4 訓練,所有主要計算(例如線性層)均在低精度下完成。通過對 Llama 型模型的廣泛評估,我們揭示了一種新的低精度縮放定律,該定律量化了不同位寬下的性能權衡,使我們能夠識別出一種在精度與計算之間達到“接近最優”的低精度訓練技術,稱為 Quartet。我們使用針對 NVIDIA Blackwell GPU 優化的 CUDA 內核實現了 Quartet,並展示其能在 FP4 精度下達到最先進的精度,成功訓練了十億級規模的模型。我們的方法證明,完全基於 FP4 的訓練是標準精度和 FP8 訓練的有力替代方案。我們的代碼可在 https://github.com/IST-DASLab/Quartet 獲取。
English
The rapid advancement of large language models (LLMs) has been paralleled by unprecedented increases in computational demands, with training costs for state-of-the-art models doubling every few months. Training models directly in low-precision arithmetic offers a solution, by improving both computational throughput and energy efficiency. Specifically, NVIDIA's recent Blackwell architecture facilitates extremely low-precision operations, specifically FP4 variants, promising substantial efficiency gains. Yet, current algorithms for training LLMs in FP4 precision face significant accuracy degradation and often rely on mixed-precision fallbacks. In this paper, we systematically investigate hardware-supported FP4 training and introduce Quartet, a new approach enabling accurate, end-to-end FP4 training with all the major computations (in e.g. linear layers) being performed in low precision. Through extensive evaluations on Llama-type models, we reveal a new low-precision scaling law that quantifies performance trade-offs across varying bit-widths and allows us to identify a "near-optimal" low-precision training technique in terms of accuracy-vs-computation, called Quartet. We implement Quartet using optimized CUDA kernels tailored for NVIDIA Blackwell GPUs, and show that it can achieve state-of-the-art accuracy for FP4 precision, successfully training billion-scale models. Our method demonstrates that fully FP4-based training is a competitive alternative to standard-precision and FP8 training. Our code is available at https://github.com/IST-DASLab/Quartet.

Summary

AI-Generated Summary

PDF702May 26, 2025