ChatPaper.aiChatPaper

相对群体REINFORCE算法实为离策略算法: ——揭开GRPO及其相关算法的一些误解

Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends

September 29, 2025
作者: Chaorui Yao, Yanxi Chen, Yuchang Sun, Yushuo Chen, Wenhao Zhang, Xuchen Pan, Yaliang Li, Bolin Ding
cs.AI

摘要

面向大规模语言模型(LLMs)的离策略强化学习(RL)正日益受到关注,这一趋势源于实际应用中的现实约束、LLM-RL基础设施的复杂性,以及对RL方法论进一步创新的需求。尽管经典的REINFORCE及其现代变体如群体相对策略优化(GRPO)通常被视为对离策略性容忍度有限的在策略算法,但本工作从基本原理出发,推导了不依赖特定训练数据分布的群体相对REINFORCE,揭示了其天然具备的离策略解释性。这一视角提炼出两条将REINFORCE适配至离策略环境的一般原则:策略更新的正则化与数据分布的主动塑造。我们的分析澄清了关于GRPO中重要性采样与裁剪作用的一些误解,将近期两种算法——在线策略镜像下降(OPMD)与非对称REINFORCE(AsymRE)——统一并重新解释为REINFORCE损失的正则化形式,并为看似启发式的数据加权策略提供了理论依据。这些发现不仅通过大量实证研究得到验证,提供了可操作的洞见,而且为LLMs离策略RL中的原则性算法设计开辟了新途径。本工作的源代码可在https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k 获取。
English
Off-policy reinforcement learning (RL) for large language models (LLMs) is attracting growing interest, driven by practical constraints in real-world applications, the complexity of LLM-RL infrastructure, and the need for further innovations of RL methodologies. While classic REINFORCE and its modern variants like Group Relative Policy Optimization (GRPO) are typically regarded as on-policy algorithms with limited tolerance of off-policyness, we present in this work a first-principles derivation for group-relative REINFORCE without assuming a specific training data distribution, showing that it admits a native off-policy interpretation. This perspective yields two general principles for adapting REINFORCE to off-policy settings: regularizing policy updates, and actively shaping the data distribution. Our analysis demystifies some myths about the roles of importance sampling and clipping in GRPO, unifies and reinterprets two recent algorithms -- Online Policy Mirror Descent (OPMD) and Asymmetric REINFORCE (AsymRE) -- as regularized forms of the REINFORCE loss, and offers theoretical justification for seemingly heuristic data-weighting strategies. Our findings lead to actionable insights that are validated with extensive empirical studies, and open up new opportunities for principled algorithm design in off-policy RL for LLMs. Source code for this work is available at https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k.
PDF62October 3, 2025