Quartet: 네이티브 FP4 훈련이 대규모 언어 모델에 최적일 수 있다

초록

대규모 언어 모델(LLM)의 급속한 발전은 전례 없는 수준의 계산 요구량 증가와 동반되어 왔으며, 최첨단 모델의 학습 비용은 몇 달마다 두 배씩 증가하고 있다. 저정밀도 연산을 통해 직접 모델을 학습시키는 것은 계산 처리량과 에너지 효율성을 모두 개선하는 해결책을 제공한다. 특히, NVIDIA의 최신 Blackwell 아키텍처는 극단적으로 낮은 정밀도 연산, 특히 FP4 변형을 가능하게 하여 상당한 효율성 향상을 약속한다. 그러나 현재 FP4 정밀도로 LLM을 학습시키는 알고리즘은 심각한 정확도 저하를 겪으며, 종종 혼합 정밀도 대체 방식을 의존한다. 본 논문에서는 하드웨어 지원 FP4 학습을 체계적으로 조사하고, 주요 계산(예: 선형 레이어)이 저정밀도로 수행되는 정확한 종단 간 FP4 학습을 가능하게 하는 새로운 접근법인 Quartet을 소개한다. Llama 유형 모델에 대한 광범위한 평가를 통해, 다양한 비트 폭에 걸친 성능 트레이드오프를 정량화하고 정확도 대 계산 측면에서 "근사 최적"의 저정밀도 학습 기법인 Quartet을 식별할 수 있는 새로운 저정밀도 스케일링 법칙을 밝혀낸다. 우리는 NVIDIA Blackwell GPU에 맞춰 최적화된 CUDA 커널을 사용하여 Quartet을 구현하고, FP4 정밀도에서 최첨단 정확도를 달성하며 10억 규모 모델을 성공적으로 학습시킬 수 있음을 보여준다. 우리의 방법은 완전한 FP4 기반 학습이 표준 정밀도 및 FP8 학습에 대한 경쟁력 있는 대안임을 입증한다. 우리의 코드는 https://github.com/IST-DASLab/Quartet에서 확인할 수 있다.

English

The rapid advancement of large language models (LLMs) has been paralleled by unprecedented increases in computational demands, with training costs for state-of-the-art models doubling every few months. Training models directly in low-precision arithmetic offers a solution, by improving both computational throughput and energy efficiency. Specifically, NVIDIA's recent Blackwell architecture facilitates extremely low-precision operations, specifically FP4 variants, promising substantial efficiency gains. Yet, current algorithms for training LLMs in FP4 precision face significant accuracy degradation and often rely on mixed-precision fallbacks. In this paper, we systematically investigate hardware-supported FP4 training and introduce Quartet, a new approach enabling accurate, end-to-end FP4 training with all the major computations (in e.g. linear layers) being performed in low precision. Through extensive evaluations on Llama-type models, we reveal a new low-precision scaling law that quantifies performance trade-offs across varying bit-widths and allows us to identify a "near-optimal" low-precision training technique in terms of accuracy-vs-computation, called Quartet. We implement Quartet using optimized CUDA kernels tailored for NVIDIA Blackwell GPUs, and show that it can achieve state-of-the-art accuracy for FP4 precision, successfully training billion-scale models. Our method demonstrates that fully FP4-based training is a competitive alternative to standard-precision and FP8 training. Our code is available at https://github.com/IST-DASLab/Quartet.

Quartet: 네이티브 FP4 훈련이 대규모 언어 모델에 최적일 수 있다

Quartet: Native FP4 Training Can Be Optimal for Large Language Models

초록

Support