論大型語言模型強化學習微調中的熵動態變化

摘要

熵作為衡量大型語言模型（LLM）輸出多樣性的關鍵指標，為理解其探索能力提供了重要視角。儘管近期研究日益關注在強化微調（RFT）過程中監控與調整熵值以平衡探索與利用，但對該過程中熵動態的系統性理解仍有待深入探討。本文建立了一個理論框架來分析RFT過程中的熵動態，首先提出了一個用於量化單次邏輯更新下熵變化的判別式。基於此推導出熵變化的一階表達式，並進一步延伸至群組相對策略優化（GRPO）的更新公式。理論分析所推導的推論與洞見不僅啟發了熵控制方法的設計，也為解讀現有研究中各類基於熵的方法提供了統一視角。我們通過實證研究支持主要結論，並驗證所推導的熵判別器截斷方法的有效性。本研究為RFT訓練動態提供了新見解，在理論支持與實踐策略層面為LLM微調過程中優化探索-利用平衡提供了新路徑。

English

Entropy serves as a critical metric for measuring the diversity of outputs generated by large language models (LLMs), providing valuable insights into their exploration capabilities. While recent studies increasingly focus on monitoring and adjusting entropy to better balance exploration and exploitation in reinforcement fine-tuning (RFT), a principled understanding of entropy dynamics during this process is yet to be thoroughly investigated. In this paper, we establish a theoretical framework for analyzing the entropy dynamics during the RFT process, which begins with a discriminant expression that quantifies entropy change under a single logit update. This foundation enables the derivation of a first-order expression for entropy change, which can be further extended to the update formula of Group Relative Policy Optimization (GRPO). The corollaries and insights drawn from the theoretical analysis inspire the design of entropy control methods, and also offer a unified lens for interpreting various entropy-based methods in existing studies. We provide empirical evidence to support the main conclusions of our analysis and demonstrate the effectiveness of the derived entropy-discriminator clipping methods. This study yields novel insights into RFT training dynamics, providing theoretical support and practical strategies for optimizing the exploration-exploitation balance during LLM fine-tuning.

論大型語言模型強化學習微調中的熵動態變化

On the Entropy Dynamics in Reinforcement Fine-Tuning of Large Language Models

摘要

Support