EDGE-GRPO: 이점 다양성을 위한 가이드된 오류 수정과 엔트로피 기반 GRPO

초록

대규모 언어 모델(LLM)은 강화 학습을 통해 단계별 추론 능력을 크게 향상시켰습니다. 그러나 희소 보상 규칙에 의존하는 그룹 상대 정책 최적화(GRPO) 알고리즘은 그룹 내 동일한 보상 문제로 인해 이점 붕괴 문제를 자주 겪습니다. 기존 연구는 주로 두 가지 관점에서 이 문제를 해결합니다: 모델 반영을 강제하여 응답 다양성을 높이는 방법과 내부 피드백을 도입하여 훈련 신호(이점)를 보강하는 방법입니다. 본 연구에서는 먼저 모델 반영의 한계를 분석하고 세부 샘플 수준에서 응답의 정책 엔트로피를 조사합니다. 실험 결과를 바탕으로, 엔트로피 주도 이점과 가이드 오류 수정을 채택한 EDGE-GRPO 알고리즘을 제안하여 이점 붕괴 문제를 효과적으로 완화합니다. 여러 주요 추론 벤치마크에서의 광범위한 실험을 통해 우리 접근법의 효과성과 우수성을 입증합니다. 이 연구는 https://github.com/ZhangXJ199/EDGE-GRPO에서 확인할 수 있습니다.

English

Large Language Models (LLMs) have made remarkable progress in enhancing step-by-step reasoning through reinforcement learning. However, the Group Relative Policy Optimization (GRPO) algorithm, which relies on sparse reward rules, often encounters the issue of identical rewards within groups, leading to the advantage collapse problem. Existing works typically address this challenge from two perspectives: enforcing model reflection to enhance response diversity, and introducing internal feedback to augment the training signal (advantage). In this work, we begin by analyzing the limitations of model reflection and investigating the policy entropy of responses at the fine-grained sample level. Based on our experimental findings, we propose the EDGE-GRPO algorithm, which adopts Entropy-Driven Advantage and Guided Error Correction to effectively mitigate the problem of advantage collapse. Extensive experiments on several main reasoning benchmarks demonstrate the effectiveness and superiority of our approach. It is available at https://github.com/ZhangXJ199/EDGE-GRPO.

EDGE-GRPO: 이점 다양성을 위한 가이드된 오류 수정과 엔트로피 기반 GRPO

EDGE-GRPO: Entropy-Driven GRPO with Guided Error Correction for Advantage Diversity

초록

Support