重新思考LLM FP4预训练中的收缩偏差：几何起源、系统性影响与UFP4方法

摘要

FP4训练有望大幅降低大语言模型预训练的内存和计算成本，然而当前FP4硬件路径和方案（包括NVIDIA Blackwell/Rubin级系统及AMD MI350系列GPU）仍以E2M1数据元素为核心。在本研究中，我们揭示了这一选择存在根本性局限：非均匀格式（如E2M1）固有地存在收缩偏差（Shrinkage Bias），即由于其可表示区间的几何不对称性导致的系统性负向舍入误差。我们表明，这种偏差在各层间以乘法方式累积，且被随机哈达玛变换（RHT）放大，从而为现有基于E2M1的FP4方案中观察到的训练不稳定性提供了统一解释。相比之下，均匀网格（E1M2/INT4）规避了这种网格几何误差，并能更有效地将RHT带来的桶利用率提升转化为更高的量化质量。基于这一发现，我们提出UFP4——一种统一4位训练方案，该方案将RHT应用于所有三种训练GEMM，同时仅对dY施加随机舍入。在Dense 1.5B、MoE 7.9B和MoE 124B的长期预训练中，UFP4持续实现比强E2M1基线更低的BF16相对损失退化，这一结果得到缩放定律分析和消融研究的支持。我们的结果表明，未来加速器应同时支持E1M2/INT4风格的统一4位网格作为与E2M1同等的一等训练原语。

English

FP4 training promises substantial reductions in memory and computation cost for LLM pretraining, yet current FP4 hardware paths and recipes, including NVIDIA Blackwell/Rubin-class systems and AMD MI350-series GPUs, remain centered on E2M1 data elements. In this study, we identify a fundamental limitation of that choice: non-uniform formats such as E2M1 inherently suffer from Shrinkage Bias, a systematic negative rounding error caused by the geometric asymmetry of their representable bins. We show that this bias accumulates multiplicatively across layers and is amplified by the Random Hadamard Transform (RHT), providing a unified explanation for the training instability observed in existing E2M1-based FP4 recipes. In contrast, uniform grids (E1M2/INT4) bypass this grid-geometry error and better convert the improved bucket utilization from RHT into higher quantization quality. Based on this finding, we propose UFP4, a uniform 4-bit training recipe that applies RHT to all three training GEMMs while restricting stochastic rounding to dY alone. On Dense 1.5B, MoE 7.9B, and MoE 124B long-run pretraining, UFP4 consistently achieves lower BF16-relative loss degradation than strong E2M1-based baselines, supported by scaling-law analysis and ablation studies. Our results suggest that future accelerators should support E1M2/INT4-style uniform 4-bit grids as first-class training primitives alongside E2M1.