FP16による学習と推論の不一致の解消

要旨

大規模言語モデル（LLM）の強化学習（RL）ファインチューニングでは、学習時と推論時のポリシー間に生じる数値的不一致により不安定性が生じることが多い。従来の研究ではアルゴリズム的補正や工学的調整によってこの問題の緩和が試みられてきたが、我々はその根本原因が浮動小数点精度そのものにあることを明らかにする。広く採用されているBF16は動的範囲が広いにも関わらず、大きな丸め誤差を導入し、学習と推論の一貫性を損なう。本研究では、単純にFP16に戻すことでこの不一致を効果的に解消できることを実証する。この変更は簡潔で、現代のフレームワークで完全にサポートされており、数行のコード変更のみで済み、モデル構造や学習アルゴリズムの修正を必要としない。実験結果から、FP16を一貫して使用することで、より安定した最適化、高速な収束、多様なタスク・アルゴリズム・フレームワークにわたる優れた性能が得られることが示唆される。本知見がRLファインチューニングにおける精度のトレードオフの再検討を促すことを期待する。

English

Reinforcement learning (RL) fine-tuning of large language models (LLMs) often suffers from instability due to the numerical mismatch between the training and inference policies. While prior work has attempted to mitigate this issue through algorithmic corrections or engineering alignments, we show that its root cause lies in the floating point precision itself. The widely adopted BF16, despite its large dynamic range, introduces large rounding errors that breaks the consistency between training and inference. In this work, we demonstrate that simply reverting to FP16 effectively eliminates this mismatch. The change is simple, fully supported by modern frameworks with only a few lines of code change, and requires no modification to the model architecture or learning algorithm. Our results suggest that using FP16 uniformly yields more stable optimization, faster convergence, and stronger performance across diverse tasks, algorithms and frameworks. We hope these findings motivate a broader reconsideration of precision trade-offs in RL fine-tuning.

FP16による学習と推論の不一致の解消

Defeating the Training-Inference Mismatch via FP16

要旨

Support