STARE：基于惊奇度的词元级优势重新加权以稳定策略熵

摘要

基於可驗證獎勵的強化學習演算法（如GRPO）已成為大型語言模型在複雜推理任務中進行後訓練的主流方法，但此類方法在訓練過程中普遍面臨策略熵崩潰的問題。我們針對GRPO進行了逐詞層級的熵動態一階梯度分析，發現一個詞元層級的信用分配不匹配問題：每個詞元的熵變化可分解為軌跡層級優勢與下一個詞元分佈上熵敏感度函數的乘積，從而形成一個優勢-驚異四象限結構，並展現出近似臨界性質。受此啟發，我們提出了STARE（基於驚異引導的詞元層級優勢重新加權以穩定策略熵），該方法通過批次內部驚異分位數識別熵關鍵詞元子集，有選擇地重新加權其有效優勢，並引入目標熵閉環閘極以實現穩定的熵調節。在從1.5B到32B規模的多個模型，以及三個任務類別（短鏈式思考、長鏈式思考與多輪工具使用）中，STARE能在數千訓練步內維持穩定的強化學習訓練，同時將策略熵保持在目標區間內。在AIME24和AIME25上，STARE的平均準確率比DAPO及其他強基線高出4%至8%，且反思詞元與響應長度同步增長，表明其維持了探索與利用的平衡，進一步釋放了強化學習的訓練潛力。程式碼開源於 https://github.com/hp-luo/STARE。

English

Reinforcement Learning with Verifiable Rewards algorithms like GRPO have emerged as the dominant post-training paradigm for complex reasoning in LLMs, yet commonly suffer from policy entropy collapse during training. We conduct a first-order gradient analysis of token-level entropy dynamics under GRPO and identify a token-level credit assignment mismatch: the per-token entropy variation decomposes into the product of the trajectory-level advantage and an entropy sensitivity function over the next-token distribution, yielding an advantage-surprisal four-quadrant structure and a near-criticality property. Motivated by it, we propose STARE (Surprisal-guided Token-level Advantage Reweighting for policy Entropy stability), which identifies entropy-critical token subsets via batch-internal surprisal quantiles, selectively reweights their effective advantages, and incorporates a target-entropy closed-loop gate for stable entropy regulation. Across model scales from 1.5B to 32B and three task families (Short CoT, Long CoT, and Multi-Turn Tool Use), STARE sustains stable RL training over thousands of steps while maintaining policy entropy within the target band. On AIME24 and AIME25, STARE outperforms DAPO and other competitive baselines by 4%-8% in average accuracy, with reflection tokens and response length growing in tandem, indicating sustained exploration-exploitation balance that further unlocks RL training potential.Code is available at https://github.com/hp-luo/STARE.