COAT:壓縮優化器狀態和激活函數以實現記憶效率的FP8訓練
COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training
October 25, 2024
作者: Haocheng Xi, Han Cai, Ligeng Zhu, Yao Lu, Kurt Keutzer, Jianfei Chen, Song Han
cs.AI
摘要
FP8 訓練已成為提高訓練效率的一種有前途的方法。現有框架通過將 FP8 計算應用於線性層來加速訓練,同時將優化器狀態和激活保留在較高精度,這未能充分優化內存使用。本文介紹了 COAT(壓縮優化器狀態和激活以進行 FP8 訓練),這是一個新穎的 FP8 訓練框架,旨在在訓練大型模型時顯著減少內存佔用。COAT 通過兩個關鍵創新來解決當前的限制:(1)動態範圍擴展,將優化器狀態分佈更緊密地對齊到 FP8 表示範圍,從而減少量化誤差;以及(2)混合粒度激活量化,使用每個張量和每個組的量化策略組合來優化激活內存。實驗表明,與 BF16 相比,COAT 能夠有效地將端到端訓練內存佔用減少 1.54 倍,同時在各種任務(如大型語言模型預訓練和微調以及視覺語言模型訓練)中實現幾乎無損的性能。COAT 還實現了比 BF16 快 1.43 倍的端到端訓練加速,表現與或超越 TransformerEngine 的加速相當。COAT 使得在較少 GPU 上高效地對大型模型進行全參數訓練成為可能,並在分佈式訓練環境中實現批次大小加倍,為大規模模型訓練的擴展提供了實用解決方案。代碼可在 https://github.com/NVlabs/COAT 上找到。
English
FP8 training has emerged as a promising method for improving training
efficiency. Existing frameworks accelerate training by applying FP8 computation
to linear layers while leaving optimizer states and activations in higher
precision, which fails to fully optimize memory usage. This paper introduces
COAT (Compressing Optimizer States and Activations for FP8 Training), a novel
FP8 training framework designed to significantly reduce memory footprint when
training large models. COAT addresses current limitations through two key
innovations: (1) Dynamic Range Expansion, which aligns optimizer state
distributions more closely with the FP8 representation range, thereby reducing
quantization error, and (2) Mixed-Granularity Activation Quantization, which
optimizes activation memory using a combination of per-tensor and per-group
quantization strategies. Experiments demonstrate that COAT effectively reduces
end-to-end training memory footprint by 1.54x compared to BF16 while achieving
nearly lossless performance across various tasks, such as Large Language Model
pretraining and fine-tuning and Vision Language Model training. COAT also
achieves a 1.43x end-to-end training speedup compared to BF16, performing on
par with or surpassing TransformerEngine's speedup. COAT enables efficient
full-parameter training of large models on fewer GPUs, and facilitates doubling
the batch size in distributed training settings, providing a practical solution
for scaling large-scale model training. The code is available at
https://github.com/NVlabs/COAT.Summary
AI-Generated Summary