STARE: Surprisal-geleide token-niveau voordeelherweging voor stabiliteit van beleidsentropie

Samenvatting

Versterkend leren met verifieerbare beloningen (RLVR) algoritmen zoals GRPO zijn naar voren gekomen als het dominante post-training paradigma voor complex redeneren in LLMs, maar hebben vaak te maken met instorting van de beleidsentropie tijdens training. Wij voeren een eerste-orde gradientanalyse uit van token-niveau entropiedynamiek onder GRPO en identificeren een token-niveau krediettoewijzingsmismatch: de per-token entropievariatie ontleedt in het product van het traject-niveau voordeel en een entropiegevoeligheidsfunctie over de volgende-token verdeling, wat resulteert in een voordeel-verrassing vierkwadrantenstructuur en een bijna-kritikaliteitseigenschap. Gemotiveerd hierdoor stellen wij STARE voor (Surprisal-guided Token-level Advantage Reweighting for policy Entropy stability), dat entropie-kritieke tokensubsets identificeert via batch-interne verrassingskwantielen, selectief hun effectieve voordelen herweegt, en een doel-entropie gesloten-lus regelaar integreert voor stabiele entropieregulatie. Over modelschalen van 1.5B tot 32B en drie taakfamilies (Short CoT, Long CoT en Multi-Turn Tool Use) handhaaft STARE stabiele RL-training over duizenden stappen terwijl de beleidsentropie binnen de doelband blijft. Op AIME24 en AIME25 presteert STARE 4%-8% beter dan DAPO en andere competitieve baselines in gemiddelde nauwkeurigheid, terwijl reflectietokens en responslengte gelijkmatig groeien, wat wijst op een aanhoudende exploratie-exploitatiebalans die het RL-trainingspotentieel verder ontgrendelt. Code is beschikbaar op https://github.com/hp-luo/STARE.

English

Reinforcement Learning with Verifiable Rewards algorithms like GRPO have emerged as the dominant post-training paradigm for complex reasoning in LLMs, yet commonly suffer from policy entropy collapse during training. We conduct a first-order gradient analysis of token-level entropy dynamics under GRPO and identify a token-level credit assignment mismatch: the per-token entropy variation decomposes into the product of the trajectory-level advantage and an entropy sensitivity function over the next-token distribution, yielding an advantage-surprisal four-quadrant structure and a near-criticality property. Motivated by it, we propose STARE (Surprisal-guided Token-level Advantage Reweighting for policy Entropy stability), which identifies entropy-critical token subsets via batch-internal surprisal quantiles, selectively reweights their effective advantages, and incorporates a target-entropy closed-loop gate for stable entropy regulation. Across model scales from 1.5B to 32B and three task families (Short CoT, Long CoT, and Multi-Turn Tool Use), STARE sustains stable RL training over thousands of steps while maintaining policy entropy within the target band. On AIME24 and AIME25, STARE outperforms DAPO and other competitive baselines by 4%-8% in average accuracy, with reflection tokens and response length growing in tandem, indicating sustained exploration-exploitation balance that further unlocks RL training potential.Code is available at https://github.com/hp-luo/STARE.