Critique-GRPO: 자연어 및 수치적 피드백을 통한 LLM 추론 능력 향상

초록

스칼라 보상과 같은 수치적 피드백을 활용한 강화학습(RL)의 최근 발전은 대규모 언어 모델(LLMs)의 복잡한 추론 능력을 크게 향상시켰습니다. 이러한 성공에도 불구하고, 순수 수치적 피드백을 사용한 RL이 직면하는 세 가지 주요 문제를 확인했습니다: 성능 정체, 자기 반성의 제한적 효과, 그리고 지속적인 실패입니다. 우리는 성능 정체를 보인 RL로 미세 조정된 모델이 비판 형태의 자연어 피드백을 활용하여 지속적으로 실패한 문제에 대해 올바른 개선안을 생성할 수 있음을 보여줍니다. 이러한 통찰을 바탕으로, 효과적인 정책 최적화를 위해 자연어와 수치적 피드백을 통합한 온라인 RL 프레임워크인 Critique-GRPO를 제안합니다. Critique-GRPO는 LLM이 초기 응답과 비판 기반 개선안을 동시에 학습하면서 탐색을 유지할 수 있도록 합니다. Qwen2.5-7B-Base와 Qwen3-8B-Base를 사용한 광범위한 실험을 통해 Critique-GRPO가 8가지 도전적인 수학, STEM, 일반 추론 과제에서 지도 학습 기반 및 RL 기반 미세 조정 접근법을 일관되게 능가하며, 평균 pass@1 점수를 각각 약 4.5%와 5% 향상시킴을 보여줍니다. 특히, Critique-GRPO는 온라인 RL 내에서 전문가 시연을 포함한 강력한 베이스라인을 능가합니다. 추가 분석을 통해 정책 탐색에 대한 두 가지 중요한 통찰을 얻었습니다: (1) 더 높은 엔트로피가 항상 탐색으로부터의 효율적인 학습을 보장하지는 않으며, (2) 더 긴 응답이 반드시 더 효과적인 탐색으로 이어지지는 않습니다.

English

Recent advances in reinforcement learning (RL) with numerical feedback, such as scalar rewards, have significantly enhanced the complex reasoning capabilities of large language models (LLMs). Despite this success, we identify three key challenges encountered by RL with solely numerical feedback: performance plateaus, limited effectiveness of self-reflection, and persistent failures. We then demonstrate that RL-finetuned models, even after exhibiting performance plateaus, can generate correct refinements on persistently failed problems by leveraging natural language feedback in the form of critiques. Building on this insight, we propose Critique-GRPO, an online RL framework that integrates both natural language and numerical feedback for effective policy optimization. Critique-GRPO enables LLMs to learn from initial responses and critique-guided refinements simultaneously while maintaining exploration. Extensive experiments using Qwen2.5-7B-Base and Qwen3-8B-Base show that Critique-GRPO consistently outperforms supervised learning-based and RL-based fine-tuning approaches across eight challenging mathematical, STEM, and general reasoning tasks, improving average pass@1 scores by approximately 4.5% and 5%, respectively. Notably, Critique-GRPO surpasses a strong baseline that incorporates expert demonstrations within online RL. Further analysis reveals two critical insights about policy exploration: (1) higher entropy does not always guarantee efficient learning from exploration, and (2) longer responses do not necessarily lead to more effective exploration.

Critique-GRPO: 자연어 및 수치적 피드백을 통한 LLM 추론 능력 향상

Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

초록

Support