대규모 언어 모델 FP4 사전 학습에서의 수축 편향 재고찰: 기하학적 기원, 시스템적 영향, 및 UFP4 레시피

초록

FP4 훈련은 LLM 사전학습에서 메모리와 연산 비용의 실질적인 감소를 약속하지만, NVIDIA Blackwell/Rubin급 시스템과 AMD MI350 시리즈 GPU를 포함한 현재의 FP4 하드웨어 경로 및 레시피는 여전히 E2M1 데이터 요소에 집중되어 있습니다. 본 연구에서 우리는 이러한 선택의 근본적인 한계를 식별합니다: E2M1과 같은 비균일 형식은 본질적으로 수축 편향(Shrinkage Bias), 즉 표현 가능한 빈의 기하학적 비대칭성으로 인한 체계적인 음수 반올림 오류를 겪습니다. 우리는 이 편향이 계층 간에 곱셈적으로 축적되고 무작위 하다마드 변환(RHT)에 의해 증폭되어, 기존 E2M1 기반 FP4 레시피에서 관찰되는 훈련 불안정성에 대한 통합된 설명을 제공함을 보여줍니다. 대조적으로, 균일 격자(E1M2/INT4)는 이러한 격자-기하학 오류를 회피하고 RHT로부터 개선된 버킷 활용도를 더 높은 양자화 품질로 더 잘 변환합니다. 이 발견을 바탕으로, 우리는 세 가지 훈련 GEMM 모두에 RHT를 적용하면서 확률적 반올림을 dY에만 제한하는 균일 4비트 훈련 레시피인 UFP4를 제안합니다. Dense 1.5B, MoE 7.9B 및 MoE 124B 장기 사전학습에서 UFP4는 스케일링 법칙 분석 및 절제 연구를 통해 강력한 E2M1 기반 기준선보다 지속적으로 더 낮은 BF16 대비 손실 성능 저하를 달성합니다. 우리의 결과는 향후 가속기가 E2M1과 함께 일급 훈련 기본 요소로 E1M2/INT4 스타일의 균일 4비트 격자를 지원해야 함을 시사합니다.

English

FP4 training promises substantial reductions in memory and computation cost for LLM pretraining, yet current FP4 hardware paths and recipes, including NVIDIA Blackwell/Rubin-class systems and AMD MI350-series GPUs, remain centered on E2M1 data elements. In this study, we identify a fundamental limitation of that choice: non-uniform formats such as E2M1 inherently suffer from Shrinkage Bias, a systematic negative rounding error caused by the geometric asymmetry of their representable bins. We show that this bias accumulates multiplicatively across layers and is amplified by the Random Hadamard Transform (RHT), providing a unified explanation for the training instability observed in existing E2M1-based FP4 recipes. In contrast, uniform grids (E1M2/INT4) bypass this grid-geometry error and better convert the improved bucket utilization from RHT into higher quantization quality. Based on this finding, we propose UFP4, a uniform 4-bit training recipe that applies RHT to all three training GEMMs while restricting stochastic rounding to dY alone. On Dense 1.5B, MoE 7.9B, and MoE 124B long-run pretraining, UFP4 consistently achieves lower BF16-relative loss degradation than strong E2M1-based baselines, supported by scaling-law analysis and ablation studies. Our results suggest that future accelerators should support E1M2/INT4-style uniform 4-bit grids as first-class training primitives alongside E2M1.