FP32를 달라, 아니면 죽음을 달라? 재현 가능한 추론을 위한 도전과 해결책

초록

대형 언어 모델(LLMs)은 현재 다양한 분야에서 필수적인 요소로 자리 잡았으며, 인상적인 성능을 보여주고 있다. 그러나 이러한 진전은 벤치마크 점수가 정확하고 재현 가능하다는 전제에 기반한다. 본 연구에서는 LLM 성능의 재현성이 취약함을 보여준다: 평가 배치 크기, GPU 수, GPU 버전과 같은 시스템 구성 변경이 생성된 응답에 상당한 차이를 초래할 수 있다. 이 문제는 특히 추론 모델에서 두드러지는데, 초기 토큰에서의 사소한 반올림 차이가 사고의 연쇄적 분기로 이어져 궁극적으로 정확도에 영향을 미칠 수 있다. 예를 들어, bfloat16 정밀도와 탐욕적 디코딩을 사용할 경우, DeepSeek-R1-Distill-Qwen-7B와 같은 추론 모델은 GPU 수, 유형, 평가 배치 크기의 차이로 인해 정확도에서 최대 9%의 변동과 응답 길이에서 9,000 토큰의 차이를 보일 수 있다. 이러한 변동성의 근본 원인은 제한된 수치 정밀도 하에서 부동소수점 연산의 비결합적 특성에 있다. 본 연구는 수치 정밀도가 LLM 추론에서의 재현성에 미치는 영향을 체계적으로 조사한 첫 번째 연구이다. 다양한 하드웨어, 소프트웨어, 정밀도 설정을 통해 신중하게 통제된 실험을 수행함으로써 모델 출력이 언제 어떻게 분기되는지를 정량화하였다. 우리의 분석은 부동소수점 정밀도가 재현성에 있어 중요함에도 불구하고 평가 관행에서 종종 간과되고 있음을 밝혀냈다. 이를 계기로, 가중치는 16비트 정밀도로 저장하되 모든 계산을 FP32로 수행하여 메모리 효율성과 수치적 안정성을 균형 있게 유지하는 경량 추론 파이프라인인 LayerCast를 개발하였다. 코드는 https://github.com/nanomaoli/llm_reproducibility에서 확인할 수 있다.

English

Large Language Models (LLMs) are now integral across various domains and have demonstrated impressive performance. Progress, however, rests on the premise that benchmark scores are both accurate and reproducible. We demonstrate that the reproducibility of LLM performance is fragile: changing system configuration such as evaluation batch size, GPU count, and GPU version can introduce significant difference in the generated responses. This issue is especially pronounced in reasoning models, where minor rounding differences in early tokens can cascade into divergent chains of thought, ultimately affecting accuracy. For instance, under bfloat16 precision with greedy decoding, a reasoning model like DeepSeek-R1-Distill-Qwen-7B can exhibit up to 9% variation in accuracy and 9,000 tokens difference in response length due to differences in GPU count, type, and evaluation batch size. We trace the root cause of this variability to the non-associative nature of floating-point arithmetic under limited numerical precision. This work presents the first systematic investigation into how numerical precision affects reproducibility in LLM inference. Through carefully controlled experiments across various hardware, software, and precision settings, we quantify when and how model outputs diverge. Our analysis reveals that floating-point precision -- while critical for reproducibility -- is often neglected in evaluation practices. Inspired by this, we develop a lightweight inference pipeline, dubbed LayerCast, that stores weights in 16-bit precision but performs all computations in FP32, balancing memory efficiency with numerical stability. Code is available at https://github.com/nanomaoli/llm_reproducibility.

FP32를 달라, 아니면 죽음을 달라? 재현 가능한 추론을 위한 도전과 해결책

Give Me FP32 or Give Me Death? Challenges and Solutions for Reproducible Reasoning

초록

Support