STARE: 惊奇度引导的令牌级别优势重新加权用于策略熵稳定性

摘要

基于可验证奖励的强化学习算法（如GRPO）已成为大语言模型复杂推理的后训练主导范式，但在训练过程中普遍存在策略熵崩塌问题。我们对GRPO下词元级熵动态进行一阶梯度分析，发现了一个词元级信用分配失配现象：每个词元的熵变化可分解为轨迹级优势与下一词元分布上的熵敏感度函数的乘积，由此形成优势-惊奇四象限结构及近临界特性。受此启发，我们提出STARE（基于惊奇的词元级优势重加权策略熵稳定性方法），该方法通过批内惊奇分位数识别熵关键词元子集，选择性重加权其有效优势，并引入目标熵闭环门控以实现稳定熵调节。在1.5B至32B的模型规模以及三个任务族（短思维链、长思维链、多轮工具使用）上，STARE能够在数千步训练中维持稳定的强化学习过程，同时将策略熵保持在目标范围内。在AIME24和AIME25上，STARE的平均准确率比DAPO及其他竞争基线高出4%-8%，且反思词元与响应长度同步增长，表明其维持了持续的探索-利用平衡，进一步释放了强化学习训练潜力。代码已开源至 https://github.com/hp-luo/STARE。

English

Reinforcement Learning with Verifiable Rewards algorithms like GRPO have emerged as the dominant post-training paradigm for complex reasoning in LLMs, yet commonly suffer from policy entropy collapse during training. We conduct a first-order gradient analysis of token-level entropy dynamics under GRPO and identify a token-level credit assignment mismatch: the per-token entropy variation decomposes into the product of the trajectory-level advantage and an entropy sensitivity function over the next-token distribution, yielding an advantage-surprisal four-quadrant structure and a near-criticality property. Motivated by it, we propose STARE (Surprisal-guided Token-level Advantage Reweighting for policy Entropy stability), which identifies entropy-critical token subsets via batch-internal surprisal quantiles, selectively reweights their effective advantages, and incorporates a target-entropy closed-loop gate for stable entropy regulation. Across model scales from 1.5B to 32B and three task families (Short CoT, Long CoT, and Multi-Turn Tool Use), STARE sustains stable RL training over thousands of steps while maintaining policy entropy within the target band. On AIME24 and AIME25, STARE outperforms DAPO and other competitive baselines by 4%-8% in average accuracy, with reflection tokens and response length growing in tandem, indicating sustained exploration-exploitation balance that further unlocks RL training potential.Code is available at https://github.com/hp-luo/STARE.