LLM FP4事前学習における収縮バイアスの再考：幾何学的起源、システム全体への影響、およびUFP4レシピ

要旨

FP4トレーニングは、大規模言語モデルの事前学習におけるメモリと計算コストの大幅な削減を約束する。しかしながら、NVIDIA Blackwell/RubinクラスシステムやAMD MI350シリーズGPUを含む現在のFP4ハードウェアパスとレシピは、依然としてE2M1データ要素を中心に設計されている。本研究では、この選択の根本的な限界を特定する。すなわち、E2M1のような非一様フォーマットは、表現可能なビンの幾何学的非対称性に起因する系統的な負の丸め誤差である「縮小バイアス（Shrinkage Bias）」を本質的に抱えている。我々は、このバイアスが層を超えて乗法的に蓄積され、ランダムアダマール変換（Random Hadamard Transform, RHT）によって増幅されることを示し、既存のE2M1ベースのFP4レシピで観察されるトレーニング不安定性に対する統一的な説明を提供する。対照的に、一様グリッド（E1M2/INT4）は、このグリッド形状に起因する誤差を回避し、RHTによる改善されたバケット利用率をより高い量子化品質に変換する。この発見に基づき、3つのトレーニング用GEMM全てにRHTを適用し、確率的丸めをdYのみに限定する、一様4ビットトレーニングレシピ「UFP4」を提案する。Dense 1.5B、MoE 7.9B、およびMoE 124Bの長期事前学習において、UFP4は、スケーリング則分析とアブレーション研究に裏付けられ、強力なE2M1ベースのベースラインと比較して、一貫して低いBF16相対損失劣化を達成する。我々の結果は、将来のアクセラレータは、E2M1と並ぶ第一級のトレーニング用プリミティブとして、E1M2/INT4スタイルの一様4ビットグリッドをサポートすべきであることを示唆している。

English

FP4 training promises substantial reductions in memory and computation cost for LLM pretraining, yet current FP4 hardware paths and recipes, including NVIDIA Blackwell/Rubin-class systems and AMD MI350-series GPUs, remain centered on E2M1 data elements. In this study, we identify a fundamental limitation of that choice: non-uniform formats such as E2M1 inherently suffer from Shrinkage Bias, a systematic negative rounding error caused by the geometric asymmetry of their representable bins. We show that this bias accumulates multiplicatively across layers and is amplified by the Random Hadamard Transform (RHT), providing a unified explanation for the training instability observed in existing E2M1-based FP4 recipes. In contrast, uniform grids (E1M2/INT4) bypass this grid-geometry error and better convert the improved bucket utilization from RHT into higher quantization quality. Based on this finding, we propose UFP4, a uniform 4-bit training recipe that applies RHT to all three training GEMMs while restricting stochastic rounding to dY alone. On Dense 1.5B, MoE 7.9B, and MoE 124B long-run pretraining, UFP4 consistently achieves lower BF16-relative loss degradation than strong E2M1-based baselines, supported by scaling-law analysis and ablation studies. Our results suggest that future accelerators should support E1M2/INT4-style uniform 4-bit grids as first-class training primitives alongside E2M1.