ChatPaper.aiChatPaper

透過FP16克服訓練與推論間的失配問題

Defeating the Training-Inference Mismatch via FP16

October 30, 2025
作者: Penghui Qi, Zichen Liu, Xiangxin Zhou, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin
cs.AI

摘要

大型語言模型的強化學習微調常因訓練與推斷策略間的數值不匹配而出現不穩定性。儘管先前研究嘗試通過算法修正或工程對齊來緩解此問題,我們發現其根本原因在於浮點運算精度本身。廣泛採用的BF16格式雖具備較大動態範圍,但其引入的顯著捨入誤差破壞了訓練與推斷的一致性。本研究證明,僅需回歸使用FP16即可有效消除這種不匹配。此調整極為簡便,現代框架可完全支援且僅需修改數行程式碼,無需改變模型架構或學習算法。實驗結果表明,統一採用FP16能在不同任務、算法和框架中實現更穩定的優化、更快的收斂速度以及更強的效能表現。我們希望這些發現能促使學界重新審視強化學習微調中的精度權衡問題。
English
Reinforcement learning (RL) fine-tuning of large language models (LLMs) often suffers from instability due to the numerical mismatch between the training and inference policies. While prior work has attempted to mitigate this issue through algorithmic corrections or engineering alignments, we show that its root cause lies in the floating point precision itself. The widely adopted BF16, despite its large dynamic range, introduces large rounding errors that breaks the consistency between training and inference. In this work, we demonstrate that simply reverting to FP16 effectively eliminates this mismatch. The change is simple, fully supported by modern frameworks with only a few lines of code change, and requires no modification to the model architecture or learning algorithm. Our results suggest that using FP16 uniformly yields more stable optimization, faster convergence, and stronger performance across diverse tasks, algorithms and frameworks. We hope these findings motivate a broader reconsideration of precision trade-offs in RL fine-tuning.
PDF311February 7, 2026