通过FP16克服训练与推理不匹配问题

摘要

大型语言模型（LLM）的强化学习（RL）微调常因训练策略与推理策略间的数值失配而存在稳定性问题。尽管先前研究尝试通过算法修正或工程对齐来缓解此问题，但我们发现其根本原因在于浮点数精度本身。广泛采用的BF16格式虽具有较大动态范围，却会引入显著舍入误差，破坏训练与推理的一致性。本研究表明，仅需恢复使用FP16格式即可有效消除这种失配。这一改动极为简便，现代框架完全支持且仅需数行代码调整，无需改变模型架构或学习算法。实验结果表明，统一采用FP16能在不同任务、算法和框架中实现更稳定的优化、更快的收敛速度以及更强的性能表现。我们希望这些发现能促使学界重新审视RL微调中的精度权衡问题。

English

Reinforcement learning (RL) fine-tuning of large language models (LLMs) often suffers from instability due to the numerical mismatch between the training and inference policies. While prior work has attempted to mitigate this issue through algorithmic corrections or engineering alignments, we show that its root cause lies in the floating point precision itself. The widely adopted BF16, despite its large dynamic range, introduces large rounding errors that breaks the consistency between training and inference. In this work, we demonstrate that simply reverting to FP16 effectively eliminates this mismatch. The change is simple, fully supported by modern frameworks with only a few lines of code change, and requires no modification to the model architecture or learning algorithm. Our results suggest that using FP16 uniformly yields more stable optimization, faster convergence, and stronger performance across diverse tasks, algorithms and frameworks. We hope these findings motivate a broader reconsideration of precision trade-offs in RL fine-tuning.

通过FP16克服训练与推理不匹配问题

Defeating the Training-Inference Mismatch via FP16

摘要

Support