给我FP32，否则宁死不屈？可复现推理的挑战与解决方案

摘要

大型语言模型（LLMs）现已成为多个领域不可或缺的一部分，并展现了卓越的性能。然而，其进展依赖于基准测试分数既准确又可复现的前提。我们揭示了LLM性能的可复现性极为脆弱：改变系统配置，如评估批次大小、GPU数量及GPU版本，均可能导致生成响应的显著差异。这一问题在推理模型中尤为突出，早期token中的微小舍入误差可能引发思维链的显著分歧，最终影响准确性。例如，在bfloat16精度下采用贪婪解码时，像DeepSeek-R1-Distill-Qwen-7B这样的推理模型，由于GPU数量、类型及评估批次大小的不同，其准确性波动可达9%，响应长度差异可达9,000个token。我们将这种变异性根源追溯至有限数值精度下浮点运算的非结合性。本研究首次系统性地探讨了数值精度如何影响LLM推理中的可复现性。通过跨多种硬件、软件及精度设置的精心控制实验，我们量化了模型输出何时及如何发生分歧。分析表明，浮点精度虽对可复现性至关重要，但在评估实践中常被忽视。受此启发，我们开发了一个轻量级推理管道，命名为LayerCast，它采用16位精度存储权重，但所有计算均在FP32下执行，从而在内存效率与数值稳定性之间取得平衡。代码已发布于https://github.com/nanomaoli/llm_reproducibility。

English

Large Language Models (LLMs) are now integral across various domains and have demonstrated impressive performance. Progress, however, rests on the premise that benchmark scores are both accurate and reproducible. We demonstrate that the reproducibility of LLM performance is fragile: changing system configuration such as evaluation batch size, GPU count, and GPU version can introduce significant difference in the generated responses. This issue is especially pronounced in reasoning models, where minor rounding differences in early tokens can cascade into divergent chains of thought, ultimately affecting accuracy. For instance, under bfloat16 precision with greedy decoding, a reasoning model like DeepSeek-R1-Distill-Qwen-7B can exhibit up to 9% variation in accuracy and 9,000 tokens difference in response length due to differences in GPU count, type, and evaluation batch size. We trace the root cause of this variability to the non-associative nature of floating-point arithmetic under limited numerical precision. This work presents the first systematic investigation into how numerical precision affects reproducibility in LLM inference. Through carefully controlled experiments across various hardware, software, and precision settings, we quantify when and how model outputs diverge. Our analysis reveals that floating-point precision -- while critical for reproducibility -- is often neglected in evaluation practices. Inspired by this, we develop a lightweight inference pipeline, dubbed LayerCast, that stores weights in 16-bit precision but performs all computations in FP32, balancing memory efficiency with numerical stability. Code is available at https://github.com/nanomaoli/llm_reproducibility.

给我FP32，否则宁死不屈？可复现推理的挑战与解决方案

Give Me FP32 or Give Me Death? Challenges and Solutions for Reproducible Reasoning

摘要

Support