Critique-GRPO:透過自然語言與數值反饋提升大型語言模型的推理能力
Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback
June 3, 2025
作者: Xiaoying Zhang, Hao Sun, Yipeng Zhang, Kaituo Feng, Chaochao Lu, Chao Yang, Helen Meng
cs.AI
摘要
近期,基於數值反饋(如標量獎勵)的強化學習(RL)技術取得了顯著進展,大幅提升了大型語言模型(LLMs)的複雜推理能力。儘管如此,我們發現僅依賴數值反饋的RL面臨三大關鍵挑戰:性能瓶頸、自我反思的有限效果以及持續性失敗。我們進一步證明,即便在性能達到瓶頸後,通過利用自然語言形式的批評反饋,RL微調模型仍能在持續失敗的問題上生成正確的改進方案。基於這一洞察,我們提出了Critique-GRPO,這是一個在線RL框架,它整合了自然語言與數值反饋,以實現有效的策略優化。Critique-GRPO使LLMs能夠在保持探索的同時,從初始響應和批評引導的改進中同步學習。利用Qwen2.5-7B-Base和Qwen3-8B-Base模型進行的廣泛實驗表明,Critique-GRPO在八項具有挑戰性的數學、STEM及通用推理任務中,持續超越基於監督學習和RL的微調方法,平均pass@1分數分別提升了約4.5%和5%。值得注意的是,Critique-GRPO甚至超越了在線RL中融入專家示範的強基線。深入分析揭示了關於策略探索的兩個重要見解:(1)更高的熵並不一定保證探索學習的效率,(2)更長的響應未必帶來更有效的探索。
English
Recent advances in reinforcement learning (RL) with numerical feedback, such
as scalar rewards, have significantly enhanced the complex reasoning
capabilities of large language models (LLMs). Despite this success, we identify
three key challenges encountered by RL with solely numerical feedback:
performance plateaus, limited effectiveness of self-reflection, and persistent
failures. We then demonstrate that RL-finetuned models, even after exhibiting
performance plateaus, can generate correct refinements on persistently failed
problems by leveraging natural language feedback in the form of critiques.
Building on this insight, we propose Critique-GRPO, an online RL framework that
integrates both natural language and numerical feedback for effective policy
optimization. Critique-GRPO enables LLMs to learn from initial responses and
critique-guided refinements simultaneously while maintaining exploration.
Extensive experiments using Qwen2.5-7B-Base and Qwen3-8B-Base show that
Critique-GRPO consistently outperforms supervised learning-based and RL-based
fine-tuning approaches across eight challenging mathematical, STEM, and general
reasoning tasks, improving average pass@1 scores by approximately 4.5% and 5%,
respectively. Notably, Critique-GRPO surpasses a strong baseline that
incorporates expert demonstrations within online RL. Further analysis reveals
two critical insights about policy exploration: (1) higher entropy does not
always guarantee efficient learning from exploration, and (2) longer responses
do not necessarily lead to more effective exploration.