엔트로피 안전 추론을 위한 분위수 이점 추정

초록

검증 가능한 보상을 활용한 강화 학습(RLVR)은 대형 언어 모델(LLM)의 추론 능력을 강화하지만, 학습 과정에서 {엔트로피 붕괴}와 {엔트로피 폭발} 사이의 진동이 자주 발생한다. 우리는 이러한 위험 요인을 가치-자유 강화 학습(예: GRPO 및 DAPO)에서 사용되는 평균 기준선에 기인한다고 추적하며, 이는 보상 이상치 하에서 부정적 이점 샘플을 부적절하게 처벌한다. 우리는 {분위수 이점 추정}(QAE)을 제안하여 평균을 그룹별 K-분위수 기준선으로 대체한다. QAE는 응답 수준에서 두 가지 체제의 게이트를 유도한다: 어려운 질의(p <= 1 - K)에서는 드문 성공을 강화하고, 쉬운 질의(p > 1 - K)에서는 남은 실패를 목표로 한다. 1차 소프트맥스 업데이트 하에서, 우리는 {양측 엔트로피 안전성}을 증명하며, 엔트로피 변화의 하한과 상한을 제공하여 폭발을 억제하고 붕괴를 방지한다. 실증적으로, 이 최소한의 수정은 엔트로피를 안정화시키고, 신용 할당을 희소화하며(조정된 K로 약 80%의 응답이 제로 이점을 받음), AIME 2024/2025 및 AMC 2023에서 Qwen3-8B/14B-Base에 걸쳐 지속적인 pass@1 향상을 가져온다. 이러한 결과는 {기준선 설계}가 토큰 수준의 휴리스틱이 아닌 RLVR 확장의 주요 메커니즘임을 확인시켜준다.

English

Reinforcement Learning with Verifiable Rewards (RLVR) strengthens LLM reasoning, but training often oscillates between {entropy collapse} and {entropy explosion}. We trace both hazards to the mean baseline used in value-free RL (e.g., GRPO and DAPO), which improperly penalizes negative-advantage samples under reward outliers. We propose {Quantile Advantage Estimation} (QAE), replacing the mean with a group-wise K-quantile baseline. QAE induces a response-level, two-regime gate: on hard queries (p <= 1 - K) it reinforces rare successes, while on easy queries (p > 1 - K) it targets remaining failures. Under first-order softmax updates, we prove {two-sided entropy safety}, giving lower and upper bounds on one-step entropy change that curb explosion and prevent collapse. Empirically, this minimal modification stabilizes entropy, sparsifies credit assignment (with tuned K, roughly 80% of responses receive zero advantage), and yields sustained pass@1 gains on Qwen3-8B/14B-Base across AIME 2024/2025 and AMC 2023. These results identify {baseline design} -- rather than token-level heuristics -- as the primary mechanism for scaling RLVR.