STARE: 놀라움 기반 토큰 수준 이점 재가중을 통한 정책 엔트로피 안정성

초록

검증 가능한 보상을 통한 강화 학습(GRPO) 알고리즘은 대규모 언어 모델의 복잡한 추론을 위한 지배적인 사후 훈련 패러다임으로 부상했지만, 훈련 중 정책 엔트로피 붕괴(policy entropy collapse)를 흔히 겪는다. 본 연구에서는 GRPO 하에서 토큰 수준 엔트로피 역학에 대한 1차 기울기 분석을 수행하고, 토큰 수준 신용 할당 불일치를 식별한다: 토큰별 엔트로피 변화는 궤적 수준 이점(trajectory-level advantage)과 다음 토큰 분포에 대한 엔트로피 민감도 함수의 곱으로 분해되며, 이는 이점-놀라움(advantage-surprisal) 4사분면 구조와 임계 근접 특성(near-criticality property)을 생성한다. 이에 착안하여, 본 연구는 STARE(놀라움 유도 토큰 수준 이점 재가중치를 통한 정책 엔트로피 안정화)를 제안한다. 이 방법은 배치 내 놀라움 분위수를 통해 엔트로피 임계 토큰 부분집합을 식별하고, 이들의 유효 이점을 선택적으로 재가중치하며, 안정적인 엔트로피 조절을 위한 목표 엔트로피 폐루프 게이트를 통합한다. 1.5B부터 32B까지의 모델 규모와 세 가지 작업군(짧은 CoT, 긴 CoT, 다회전 도구 사용)에 걸쳐, STARE는 수천 단계의 훈련 동안 정책 엔트로피를 목표 대역 내로 유지하며 안정적인 강화 학습 훈련을 유지한다. AIME24 및 AIME25에서 STARE는 DAPO 및 기타 경쟁 기준선 대비 평균 정확도에서 4%-8% 향상된 성능을 보이며, 반성 토큰과 응답 길이가 함께 증가하는데, 이는 지속적인 탐색-활용 균형을 나타내며 강화 학습 훈련 잠재력을 더욱 발휘하게 한다. 코드는 https://github.com/hp-luo/STARE에서 확인할 수 있다.

English

Reinforcement Learning with Verifiable Rewards algorithms like GRPO have emerged as the dominant post-training paradigm for complex reasoning in LLMs, yet commonly suffer from policy entropy collapse during training. We conduct a first-order gradient analysis of token-level entropy dynamics under GRPO and identify a token-level credit assignment mismatch: the per-token entropy variation decomposes into the product of the trajectory-level advantage and an entropy sensitivity function over the next-token distribution, yielding an advantage-surprisal four-quadrant structure and a near-criticality property. Motivated by it, we propose STARE (Surprisal-guided Token-level Advantage Reweighting for policy Entropy stability), which identifies entropy-critical token subsets via batch-internal surprisal quantiles, selectively reweights their effective advantages, and incorporates a target-entropy closed-loop gate for stable entropy regulation. Across model scales from 1.5B to 32B and three task families (Short CoT, Long CoT, and Multi-Turn Tool Use), STARE sustains stable RL training over thousands of steps while maintaining policy entropy within the target band. On AIME24 and AIME25, STARE outperforms DAPO and other competitive baselines by 4%-8% in average accuracy, with reflection tokens and response length growing in tandem, indicating sustained exploration-exploitation balance that further unlocks RL training potential.Code is available at https://github.com/hp-luo/STARE.