ChatPaper.aiChatPaper

EntroPIC:基于比例-积分控制的熵稳定化实现大语言模型长期稳定训练

EntroPIC: Towards Stable Long-Term Training of LLMs via Entropy Stabilization with Proportional-Integral Control

November 19, 2025
作者: Kai Yang, Xin Xu, Yangkun Chen, Weijie Liu, Jiafei Lyu, Zichuan Lin, Deheng Ye, Saiyong Yang
cs.AI

摘要

大规模语言模型(LLM)的长期训练需要保持稳定的探索性,以防止模型陷入次优行为。熵在此过程中至关重要,它通过调控探索强度来避免模型过早收敛至次优解。然而现有强化学习方法难以维持适宜的熵水平,因为训练过程同时包含正负样本,而每类样本在不同训练阶段对熵的影响方式各异。为此,我们提出基于比例-积分控制的熵稳定方法(EntroPIC),该方法通过动态调整正负样本的损失系数来自适应调节其影响,从而实现训练全程的熵稳定,确保高效探索与稳定进展。我们针对同策略与异策略学习场景进行了完整的理论分析,证明EntroPIC能有效控制大规模LLM训练中的熵变化。实验结果表明,本方法可成功维持目标熵水平,为LLM实现稳定且最优的强化学习训练。
English
Long-term training of large language models (LLMs) requires maintaining stable exploration to prevent the model from collapsing into sub-optimal behaviors. Entropy is crucial in this context, as it controls exploration and helps avoid premature convergence to sub-optimal solutions. However, existing reinforcement learning methods struggle to maintain an appropriate level of entropy, as the training process involves a mix of positive and negative samples, each affecting entropy in different ways across steps. To address this, we propose Entropy stablilization via Proportional-Integral Control (EntroPIC), a novel method that adaptively adjusts the influence of positive and negative samples by dynamically tuning their loss coefficients. This approach stabilizes entropy throughout training, ensuring efficient exploration and steady progress. We provide a comprehensive theoretical analysis for both on-policy and off-policy learning settings, demonstrating that EntroPIC is effective at controlling entropy in large-scale LLM training. Experimental results show that our method successfully maintains desired entropy levels, enabling stable and optimal RL training for LLMs.
PDF52December 1, 2025