어떤 프롬프트도 놓치지 않는다: 엔트로피 기반 어드밴티지 형성을 통한 LLM 강화 학습에서의 제로-분산 프롬프트 활용

초록

검증 가능한 보상을 통한 강화 학습(Reinforcement Learning with Verifiable Rewards, RLVR)은 대규모 언어 모델(Large Language Models, LLMs)의 추론 능력을 향상시키기 위한 강력한 프레임워크입니다. 그러나 GRPO와 같은 현재의 방법들은 동일한 입력에 대한 모델 응답이 정확성 측면에서 차이가 나는 문제에만 의존하고, 모든 응답이 동일한 보상을 받는 소위 '제로-분산 프롬프트(zero-variance prompts)'는 무시합니다. 본 연구에서는 이러한 프롬프트가 무의미하지 않으며, 사실상 정책 최적화를 위한 의미 있는 피드백을 제공할 수 있다고 주장합니다. 이를 위해, 우리는 제로-분산 프롬프트에서 학습 신호를 추출하는 새로운 알고리즘인 RL-ZVP(Reinforcement Learning with Zero-Variance Prompts)를 소개합니다. RL-ZVP는 대조적인 응답 없이도 정확성을 보상하고 오류를 패널티로 처리하며, 토큰 수준의 특성을 활용하여 피드백을 조절함으로써 정보가 풍부하고 세밀한 신호를 보존합니다. 6개의 수학 추론 벤치마크에서 RL-ZVP는 GRPO 대비 최대 8.61점의 정확도 향상과 7.77점의 통과율 향상을 달성했으며, 제로-분산 프롬프트를 필터링하는 다른 베이스라인들을 일관되게 능가했습니다. 이러한 결과는 RLVR에서 제로-분산 프롬프트로부터 학습할 수 있는 잠재력을 강조합니다.

English

Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful framework for improving the reasoning abilities of Large Language Models (LLMs). However, current methods such as GRPO rely only on problems where the model responses to the same input differ in correctness, while ignoring those where all responses receive the same reward - so-called zero-variance prompts. In this work, we argue that such prompts are not useless but can, in fact, provide meaningful feedback for policy optimization. To this end, we introduce RL with Zero-Variance Prompts (RL-ZVP), a novel algorithm that extract learning signals from zero-variance prompts. RL-ZVP directly rewards correctness and penalizes errors even without contrasting responses, modulating feedback with token-level characteristics to preserve informative, nuanced signals. Across six math reasoning benchmarks, RL-ZVP achieves significant improvements of up to 8.61 points in accuracy and 7.77 points in pass rate over GRPO, while consistently outperforming other baselines that filter out zero-variance prompts. These results highlight the untapped potential of learning from zero-variance prompts in RLVR.

어떤 프롬프트도 놓치지 않는다: 엔트로피 기반 어드밴티지 형성을 통한 LLM 강화 학습에서의 제로-분산 프롬프트 활용

No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping

초록

Support