iGRPO:自反馈驱动的大语言模型推理
iGRPO: Self-Feedback-Driven LLM Reasoning
February 9, 2026
作者: Ali Hatamizadeh, Shrimai Prabhumoye, Igor Gitman, Ximing Lu, Seungju Han, Wei Ping, Yejin Choi, Jan Kautz
cs.AI
摘要
大型语言模型(LLMs)在解决复杂数学问题方面展现出潜力,但其生成结果的准确性与一致性仍有不足。强化学习(RL)作为一种框架,可通过任务特定奖励信号对齐模型行为,从而提升整体质量与可靠性。群体相对策略优化(GRPO)是近端策略优化(PPO)的一种高效、无需价值函数的替代方案,其核心在于采用群体相对奖励归一化方法。本文提出迭代式群体相对策略优化(iGRPO),该两阶段扩展算法在GRPO基础上引入模型生成草稿的动态自条件机制。第一阶段,iGRPO通过采样多份探索性草稿,并基于优化所用的标量奖励信号筛选出最优草稿;第二阶段,将最优草稿附加至原始提示后,对基于草稿条件的优化结果实施GRPO式更新,训练策略突破其先前最佳尝试。在相同计算预算下,iGRPO在多个基础模型(如Nemotron-H-8B-Base-8K与DeepSeek-R1 Distilled)上均稳定超越GRPO,在不同推理基准测试中验证了其有效性。此外,将iGRPO应用于基于AceReason-Math数据集训练的OpenReasoning-Nemotron-7B模型后,在AIME24和AIME25上分别达到85.62%与79.64%的最新顶尖水平。消融实验进一步表明:优化框架可泛化至GRPO变体之外,生成式评判器能提升性能,且该机制通过延迟熵坍缩改变学习动态。这些成果凸显了基于迭代式自反馈的强化学习在推进可验证数学推理方面的潜力。
English
Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions. Reinforcement Learning (RL) is a framework for aligning these models with task-specific rewards, improving overall quality and reliability. Group Relative Policy Optimization (GRPO) is an efficient, value-function-free alternative to Proximal Policy Optimization (PPO) that leverages group-relative reward normalization. We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts. In Stage 1, iGRPO samples multiple exploratory drafts and selects the highest-reward draft using the same scalar reward signal used for optimization. In Stage 2, it appends this best draft to the original prompt and applies a GRPO-style update on draft-conditioned refinements, training the policy to improve beyond its strongest prior attempt. Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models (e.g., Nemotron-H-8B-Base-8K and DeepSeek-R1 Distilled), validating its effectiveness on diverse reasoning benchmarks. Moreover, applying iGRPO to OpenReasoning-Nemotron-7B trained on AceReason-Math achieves new state-of-the-art results of 85.62\% and 79.64\% on AIME24 and AIME25, respectively. Ablations further show that the refinement wrapper generalizes beyond GRPO variants, benefits from a generative judge, and alters learning dynamics by delaying entropy collapse. These results underscore the potential of iterative, self-feedback-based RL for advancing verifiable mathematical reasoning.