COAT: メモリ効率の高いFP8トレーニングのための最適化状態と活性化の圧縮

要旨

FP8トレーニングは、トレーニング効率を向上させる有望な方法として登場しています。既存のフレームワークは、FP8演算を線形層に適用してトレーニングを加速させますが、最適化器の状態と活性化をより高い精度で残すことで、メモリ使用量の最適化が不十分です。本論文では、大規模モデルのトレーニング時のメモリフットプリントを大幅に削減するよう設計された新しいFP8トレーニングフレームワークであるCOAT（Compressing Optimizer States and Activations for FP8 Training）を紹介します。COATは、現在の制限事項に対処するために、次の2つの主要な革新を通じて行います：（1）Dynamic Range Expansionは、最適化器の状態分布をFP8表現範囲により近づけることで、量子化誤差を減らし、（2）Mixed-Granularity Activation Quantizationは、テンソルごととグループごとの量子化戦略の組み合わせを使用して、活性化メモリを最適化します。実験では、COATが、BF16と比較してエンドツーエンドのトレーニングメモリフットプリントを1.54倍削減し、大規模言語モデルの事前トレーニングや微調整、ビジョン言語モデルのトレーニングなど、さまざまなタスクでほぼ損失なくパフォーマンスを達成することを示しています。また、COATは、BF16と比較してエンドツーエンドのトレーニングスピードアップを1.43倍実現し、TransformerEngineのスピードアップと同等以上の性能を発揮します。COATは、少ないGPUで大規模モデルの効率的なフルパラメータートレーニングを可能にし、分散トレーニング設定でのバッチサイズの2倍化を容易にし、大規模モデルのトレーニングをスケーリングするための実用的なソリューションを提供します。コードはhttps://github.com/NVlabs/COAT で入手可能です。

English

FP8 training has emerged as a promising method for improving training efficiency. Existing frameworks accelerate training by applying FP8 computation to linear layers while leaving optimizer states and activations in higher precision, which fails to fully optimize memory usage. This paper introduces COAT (Compressing Optimizer States and Activations for FP8 Training), a novel FP8 training framework designed to significantly reduce memory footprint when training large models. COAT addresses current limitations through two key innovations: (1) Dynamic Range Expansion, which aligns optimizer state distributions more closely with the FP8 representation range, thereby reducing quantization error, and (2) Mixed-Granularity Activation Quantization, which optimizes activation memory using a combination of per-tensor and per-group quantization strategies. Experiments demonstrate that COAT effectively reduces end-to-end training memory footprint by 1.54x compared to BF16 while achieving nearly lossless performance across various tasks, such as Large Language Model pretraining and fine-tuning and Vision Language Model training. COAT also achieves a 1.43x end-to-end training speedup compared to BF16, performing on par with or surpassing TransformerEngine's speedup. COAT enables efficient full-parameter training of large models on fewer GPUs, and facilitates doubling the batch size in distributed training settings, providing a practical solution for scaling large-scale model training. The code is available at https://github.com/NVlabs/COAT.

COAT: メモリ効率の高いFP8トレーニングのための最適化状態と活性化の圧縮

COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training

要旨

Support