FP4量化大模型训练中的均值偏差:诅咒与福音
The Curse and Blessing of Mean Bias in FP4-Quantized LLM Training
March 11, 2026
作者: Hengjie Cao, Zhendong Huang, Mengyi Chen, Yifeng Yang, Fanqi Yu, Ruijun Huang, Fang Dong, Xin Zhang, Jixian Zhou, Anrui Chen, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Qin Lv, Yuan Cheng, Tun Lu, Fan Yang, Li Shang
cs.AI
摘要
基于自然语言训练的大语言模型表现出显著的各向异性:少数方向聚集了不成比例的能量,而其余维度构成广阔的语义尾部。在低比特训练机制中,这种几何结构会引发数值不稳定性。由于分块量化尺度由元素级极值决定,主导方向会拉伸动态范围,将长尾语义变化压缩至狭窄的数值区间。我们证明这种不稳定性主要由一致的秩一均值偏差驱动,该偏差构成LLM表征中谱各向异性的主导成分。该均值成分在不同网络层和训练阶段系统性地涌现,并占据极端激活值的主要部分,使其成为低精度下动态范围膨胀的核心诱因。关键的是,由于主导不稳定性具有秩一特性,可通过简单的源级均值扣除操作消除。这种以偏差为中心的调节方法在仅需归约操作和标准量化内核的前提下,即可获得基于SVD的谱方法带来的大部分稳定性优势。FP4(W4A4G4)训练的实证结果表明,均值移除能显著缩小与BF16的损失差距并恢复下游任务性能,为稳定低比特LLM训练提供了硬件高效的实现路径。
English
Large language models trained on natural language exhibit pronounced anisotropy: a small number of directions concentrate disproportionate energy, while the remaining dimensions form a broad semantic tail. In low-bit training regimes, this geometry becomes numerically unstable. Because blockwise quantization scales are determined by extreme elementwise magnitudes, dominant directions stretch the dynamic range, compressing long-tail semantic variation into narrow numerical bins. We show that this instability is primarily driven by a coherent rank-one mean bias, which constitutes the dominant component of spectral anisotropy in LLM representations. This mean component emerges systematically across layers and training stages and accounts for the majority of extreme activation magnitudes, making it the principal driver of dynamic-range inflation under low precision. Crucially, because the dominant instability is rank-one, it can be eliminated through a simple source-level mean-subtraction operation. This bias-centric conditioning recovers most of the stability benefits of SVD-based spectral methods while requiring only reduction operations and standard quantization kernels. Empirical results on FP4 (W4A4G4) training show that mean removal substantially narrows the loss gap to BF16 and restores downstream performance, providing a hardware-efficient path to stable low-bit LLM training.