Critique-GRPO:通过自然语言与数值反馈提升大语言模型推理能力
Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback
June 3, 2025
作者: Xiaoying Zhang, Hao Sun, Yipeng Zhang, Kaituo Feng, Chaochao Lu, Chao Yang, Helen Meng
cs.AI
摘要
近期,基于数值反馈(如标量奖励)的强化学习(RL)显著提升了大型语言模型(LLMs)的复杂推理能力。然而,尽管取得了这些成功,我们发现仅依赖数值反馈的RL面临三个关键挑战:性能瓶颈、自我反思的局限性以及持续失败问题。我们进一步证明,即使在性能达到瓶颈后,通过利用自然语言形式的批评反馈,RL微调模型仍能在持续失败的问题上生成正确的改进方案。基于这一发现,我们提出了Critique-GRPO,一种在线RL框架,它整合了自然语言与数值反馈,以实现有效的策略优化。Critique-GRPO使LLMs能够同时从初始响应和批评引导的改进中学习,同时保持探索性。使用Qwen2.5-7B-Base和Qwen3-8B-Base进行的广泛实验表明,Critique-GRPO在八项具有挑战性的数学、STEM及通用推理任务中,持续优于基于监督学习和RL的微调方法,平均pass@1分数分别提升了约4.5%和5%。值得注意的是,Critique-GRPO甚至超越了在在线RL中融入专家示范的强基线。进一步分析揭示了关于策略探索的两个关键见解:(1)更高的熵并不总能保证从探索中高效学习,(2)更长的响应并不必然带来更有效的探索。
English
Recent advances in reinforcement learning (RL) with numerical feedback, such
as scalar rewards, have significantly enhanced the complex reasoning
capabilities of large language models (LLMs). Despite this success, we identify
three key challenges encountered by RL with solely numerical feedback:
performance plateaus, limited effectiveness of self-reflection, and persistent
failures. We then demonstrate that RL-finetuned models, even after exhibiting
performance plateaus, can generate correct refinements on persistently failed
problems by leveraging natural language feedback in the form of critiques.
Building on this insight, we propose Critique-GRPO, an online RL framework that
integrates both natural language and numerical feedback for effective policy
optimization. Critique-GRPO enables LLMs to learn from initial responses and
critique-guided refinements simultaneously while maintaining exploration.
Extensive experiments using Qwen2.5-7B-Base and Qwen3-8B-Base show that
Critique-GRPO consistently outperforms supervised learning-based and RL-based
fine-tuning approaches across eight challenging mathematical, STEM, and general
reasoning tasks, improving average pass@1 scores by approximately 4.5% and 5%,
respectively. Notably, Critique-GRPO surpasses a strong baseline that
incorporates expert demonstrations within online RL. Further analysis reveals
two critical insights about policy exploration: (1) higher entropy does not
always guarantee efficient learning from exploration, and (2) longer responses
do not necessarily lead to more effective exploration.