FP4量化大语言模型训练中均值偏差的诅咒与祝福 (注:FP4指4位浮点数量化格式。该标题采用学术论文常见的矛盾修辞法,通过"诅咒"与"祝福"的对比凸显量化误差的双重影响,既保留原文的文学张力,又符合中文论文标题的简洁性要求。)
The Curse and Blessing of Mean Bias in FP4-Quantized LLM Training
March 11, 2026
作者: Hengjie Cao, Zhendong Huang, Mengyi Chen, Yifeng Yang, Fanqi Yu, Ruijun Huang, Fang Dong, Xin Zhang, Jixian Zhou, Anrui Chen, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Qin Lv, Yuan Cheng, Tun Lu, Fan Yang, Li Shang
cs.AI
摘要
在自然语言上训练的大型语言模型表现出显著的各向异性:少数方向集中了不成比例的能量,而其余维度则形成宽广的语义尾部。在低比特训练机制中,这种几何结构会变得数值不稳定。由于分块量化尺度由极端元素幅值决定,主导方向会拉伸动态范围,将长尾语义变化压缩至狭窄的数值区间。我们发现这种不稳定性主要由一个连贯的秩为一的均值偏差驱动,该偏差构成了LLM表示中谱各向异性的主导成分。该均值成分在不同层级和训练阶段系统性地涌现,并构成了大多数极端激活幅值,使其成为低精度下动态范围膨胀的主要推手。关键在于,由于主导不稳定性具有秩为一的特性,可通过简单的源级均值扣除操作予以消除。这种以偏差为中心的调节方法能复现基于SVD的谱方法的大部分稳定性优势,同时仅需归约操作和标准量化内核。FP4(W4A4G4)训练的实证结果表明,均值移除能大幅缩小与BF16的损失差距并恢复下游性能,为稳定低比特LLM训练提供了硬件高效的路径。
English
Large language models trained on natural language exhibit pronounced anisotropy: a small number of directions concentrate disproportionate energy, while the remaining dimensions form a broad semantic tail. In low-bit training regimes, this geometry becomes numerically unstable. Because blockwise quantization scales are determined by extreme elementwise magnitudes, dominant directions stretch the dynamic range, compressing long-tail semantic variation into narrow numerical bins. We show that this instability is primarily driven by a coherent rank-one mean bias, which constitutes the dominant component of spectral anisotropy in LLM representations. This mean component emerges systematically across layers and training stages and accounts for the majority of extreme activation magnitudes, making it the principal driver of dynamic-range inflation under low precision. Crucially, because the dominant instability is rank-one, it can be eliminated through a simple source-level mean-subtraction operation. This bias-centric conditioning recovers most of the stability benefits of SVD-based spectral methods while requiring only reduction operations and standard quantization kernels. Empirical results on FP4 (W4A4G4) training show that mean removal substantially narrows the loss gap to BF16 and restores downstream performance, providing a hardware-efficient path to stable low-bit LLM training.