论大型语言模型强化微调中的熵动态变化

摘要

熵作为衡量大语言模型（LLM）输出多样性的关键指标，为理解其探索能力提供了重要视角。尽管近期研究日益关注通过监控和调节熵来优化强化微调（RFT）过程中的探索-利用平衡，但对该过程中熵动态变化的理论认知仍有待深入探索。本文建立了分析RFT过程中熵动态的理论框架，首先提出量化单个逻辑单元更新下熵变化的判别式，进而推导出熵变化的一阶表达式，并将其扩展至群组相对策略优化（GRPO）的更新公式。理论分析得出的推论与洞见不仅启发了熵控制方法的设计，也为现有研究中各类基于熵的方法提供了统一解读视角。我们通过实验证据支持主要结论，并验证了所推导的熵判别器裁剪方法的有效性。本研究为RFT训练动态提供了新见解，为优化LLM微调过程中的探索-利用平衡奠定了理论基础并提出了实用策略。

English

Entropy serves as a critical metric for measuring the diversity of outputs generated by large language models (LLMs), providing valuable insights into their exploration capabilities. While recent studies increasingly focus on monitoring and adjusting entropy to better balance exploration and exploitation in reinforcement fine-tuning (RFT), a principled understanding of entropy dynamics during this process is yet to be thoroughly investigated. In this paper, we establish a theoretical framework for analyzing the entropy dynamics during the RFT process, which begins with a discriminant expression that quantifies entropy change under a single logit update. This foundation enables the derivation of a first-order expression for entropy change, which can be further extended to the update formula of Group Relative Policy Optimization (GRPO). The corollaries and insights drawn from the theoretical analysis inspire the design of entropy control methods, and also offer a unified lens for interpreting various entropy-based methods in existing studies. We provide empirical evidence to support the main conclusions of our analysis and demonstrate the effectiveness of the derived entropy-discriminator clipping methods. This study yields novel insights into RFT training dynamics, providing theoretical support and practical strategies for optimizing the exploration-exploitation balance during LLM fine-tuning.