ChatPaper.aiChatPaper

STAPO:通过抑制罕见伪标记实现大语言模型强化学习的稳定性优化

STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

February 17, 2026
作者: Shiqi Liu, Zeyu He, Guojian Zhan, Letian Tao, Zhilong Zheng, Jiang Wu, Yinuo Wang, Yang Guan, Kehua Sheng, Bo Zhang, Keqiang Li, Jingliang Duan, Shengbo Eben Li
cs.AI

摘要

强化学习(RL)在大语言模型推理方面取得了显著进展,但现有的RL微调方法严重依赖启发式技术(如熵正则化和权重调整)来维持稳定性。实践中,这些方法常出现后期性能崩溃现象,导致推理质量下降和训练不稳定。我们推导出RL中词元级策略梯度的大小与词元概率及局部策略熵呈负相关。基于此发现,我们证明训练不稳定性是由约0.01%的极小比例词元(称为伪相关词元)驱动的。当此类词元出现在正确响应中时,它们对推理结果的贡献微乎其微,却继承了完整的序列级奖励,导致梯度更新异常放大。受此启发,我们提出面向大规模模型优化的伪相关词元感知策略优化(STAPO),该方法选择性地屏蔽此类更新并对有效词元的损失进行重归一化处理。在使用Qwen 1.7B、8B和14B基础模型的六项数学推理基准测试中,STAPO始终展现出更优的熵稳定性,相较GRPO、20-Entropy和JustRL方法平均性能提升达7.13%。
English
Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain stability. In practice, they often experience late-stage performance collapse, leading to degraded reasoning quality and unstable training. We derive that the magnitude of token-wise policy gradients in RL is negatively correlated with token probability and local policy entropy. Building on this result, we prove that training instability is driven by a tiny fraction of tokens, approximately 0.01\%, which we term spurious tokens. When such tokens appear in correct responses, they contribute little to the reasoning outcome but inherit the full sequence-level reward, leading to abnormally amplified gradient updates. Motivated by this observation, we propose Spurious-Token-Aware Policy Optimization (STAPO) for large-scale model refining, which selectively masks such updates and renormalizes the loss over valid tokens. Across six mathematical reasoning benchmarks using Qwen 1.7B, 8B, and 14B base models, STAPO consistently demonstrates superior entropy stability and achieves an average performance improvement of 7.13\% over GRPO, 20-Entropy and JustRL.
PDF31February 19, 2026