Critique-GRPO: 自然言語と数値的フィードバックによるLLM推論の進化

要旨

数値的フィードバック（スカラー報酬など）を用いた強化学習（RL）の最近の進展により、大規模言語モデル（LLM）の複雑な推論能力が大幅に向上しています。しかしながら、数値的フィードバックのみに依存するRLには、性能の停滞、自己反省の限定的な効果、そして持続的な失敗という3つの主要な課題が存在することが明らかになりました。本研究では、性能が停滞した後でも、RLでファインチューニングされたモデルが、自然言語による批評を活用することで、持続的に失敗していた問題に対して正しい改善案を生成できることを実証します。この知見に基づき、自然言語と数値的フィードバックを統合した効果的なポリシー最適化のためのオンラインRLフレームワーク「Critique-GRPO」を提案します。Critique-GRPOは、LLMが初期応答と批評に基づく改善案を同時に学習しつつ、探索を維持することを可能にします。Qwen2.5-7B-BaseおよびQwen3-8B-Baseを用いた広範な実験により、Critique-GRPOが、8つの難易度の高い数学、STEM、および一般的な推論タスクにおいて、教師あり学習ベースおよびRLベースのファインチューニング手法を一貫して上回り、平均pass@1スコアをそれぞれ約4.5％および5％向上させることが示されました。特に、Critique-GRPOは、オンラインRLに専門家のデモンストレーションを組み込んだ強力なベースラインを凌駕しています。さらなる分析から、ポリシー探索に関する2つの重要な洞察が得られました：（1）エントロピーが高いことが必ずしも探索からの効率的な学習を保証するわけではないこと、（2）長い応答が必ずしも効果的な探索につながるわけではないことです。

English

Recent advances in reinforcement learning (RL) with numerical feedback, such as scalar rewards, have significantly enhanced the complex reasoning capabilities of large language models (LLMs). Despite this success, we identify three key challenges encountered by RL with solely numerical feedback: performance plateaus, limited effectiveness of self-reflection, and persistent failures. We then demonstrate that RL-finetuned models, even after exhibiting performance plateaus, can generate correct refinements on persistently failed problems by leveraging natural language feedback in the form of critiques. Building on this insight, we propose Critique-GRPO, an online RL framework that integrates both natural language and numerical feedback for effective policy optimization. Critique-GRPO enables LLMs to learn from initial responses and critique-guided refinements simultaneously while maintaining exploration. Extensive experiments using Qwen2.5-7B-Base and Qwen3-8B-Base show that Critique-GRPO consistently outperforms supervised learning-based and RL-based fine-tuning approaches across eight challenging mathematical, STEM, and general reasoning tasks, improving average pass@1 scores by approximately 4.5% and 5%, respectively. Notably, Critique-GRPO surpasses a strong baseline that incorporates expert demonstrations within online RL. Further analysis reveals two critical insights about policy exploration: (1) higher entropy does not always guarantee efficient learning from exploration, and (2) longer responses do not necessarily lead to more effective exploration.

Critique-GRPO: 自然言語と数値的フィードバックによるLLM推論の進化

Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

要旨

Support