给我FP32,否则宁死不屈?可复现推理的挑战与解决方案
Give Me FP32 or Give Me Death? Challenges and Solutions for Reproducible Reasoning
June 11, 2025
作者: Jiayi Yuan, Hao Li, Xinheng Ding, Wenya Xie, Yu-Jhe Li, Wentian Zhao, Kun Wan, Jing Shi, Xia Hu, Zirui Liu
cs.AI
摘要
大型语言模型(LLMs)现已广泛应用于多个领域,并展现出卓越的性能。然而,这一进展建立在基准测试分数既准确又可复现的前提之上。我们揭示了LLM性能复现性的脆弱性:改变系统配置,如评估批次大小、GPU数量及GPU版本,均可能导致生成响应的显著差异。这一问题在推理模型中尤为突出,早期token中的微小舍入差异可能引发思维链的显著分歧,最终影响准确性。例如,在bfloat16精度下采用贪婪解码时,像DeepSeek-R1-Distill-Qwen-7B这样的推理模型,由于GPU数量、类型及评估批次大小的不同,其准确性可产生高达9%的波动,响应长度差异可达9,000个token。我们将这种变异性根源归结于有限数值精度下浮点运算的非结合性。本研究首次系统性地探讨了数值精度如何影响LLM推理的复现性。通过跨硬件、软件及精度设置的精心控制实验,我们量化了模型输出何时及如何发生分歧。分析表明,浮点精度虽对复现性至关重要,但在评估实践中常被忽视。受此启发,我们开发了一个轻量级推理管道,名为LayerCast,它采用16位精度存储权重,但所有计算均在FP32下进行,从而在内存效率与数值稳定性之间取得平衡。代码发布于https://github.com/nanomaoli/llm_reproducibility。
English
Large Language Models (LLMs) are now integral across various domains and have
demonstrated impressive performance. Progress, however, rests on the premise
that benchmark scores are both accurate and reproducible. We demonstrate that
the reproducibility of LLM performance is fragile: changing system
configuration such as evaluation batch size, GPU count, and GPU version can
introduce significant difference in the generated responses. This issue is
especially pronounced in reasoning models, where minor rounding differences in
early tokens can cascade into divergent chains of thought, ultimately affecting
accuracy. For instance, under bfloat16 precision with greedy decoding, a
reasoning model like DeepSeek-R1-Distill-Qwen-7B can exhibit up to 9% variation
in accuracy and 9,000 tokens difference in response length due to differences
in GPU count, type, and evaluation batch size. We trace the root cause of this
variability to the non-associative nature of floating-point arithmetic under
limited numerical precision. This work presents the first systematic
investigation into how numerical precision affects reproducibility in LLM
inference. Through carefully controlled experiments across various hardware,
software, and precision settings, we quantify when and how model outputs
diverge. Our analysis reveals that floating-point precision -- while critical
for reproducibility -- is often neglected in evaluation practices. Inspired by
this, we develop a lightweight inference pipeline, dubbed LayerCast, that
stores weights in 16-bit precision but performs all computations in FP32,
balancing memory efficiency with numerical stability. Code is available at
https://github.com/nanomaoli/llm_reproducibility.