ChatPaper.aiChatPaper

群體相對REINFORCE實為離策略算法: 揭開GRPO及其同類算法的一些迷思

Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends

September 29, 2025
作者: Chaorui Yao, Yanxi Chen, Yuchang Sun, Yushuo Chen, Wenhao Zhang, Xuchen Pan, Yaliang Li, Bolin Ding
cs.AI

摘要

針對大型語言模型(LLMs)的離策略強化學習(RL)正日益受到關注,這是由於現實應用中的實際限制、LLM-RL基礎設施的複雜性,以及對RL方法進一步創新的需求所驅動。雖然經典的REINFORCE及其現代變體如群組相對策略優化(GRPO)通常被視為對策略性算法,對離策略性的容忍度有限,但我們在本研究中從基本原理出發,推導了不假設特定訓練數據分佈的群組相對REINFORCE,展示了其天然具備的離策略解釋。這一視角為將REINFORCE適應於離策略設置提供了兩項通用原則:正則化策略更新,以及主動塑造數據分佈。我們的分析澄清了關於重要性採樣和裁剪在GRPO中作用的一些誤解,統一並重新解釋了兩種近期算法——在線策略鏡像下降(OPMD)和非對稱REINFORCE(AsymRE)——作為REINFORCE損失的正則化形式,並為看似啟發式的數據加權策略提供了理論依據。我們的研究成果提供了可操作的見解,並通過廣泛的實證研究得到驗證,為LLMs的離策略RL中的原則性算法設計開闢了新的機會。本工作的源代碼可在https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k獲取。
English
Off-policy reinforcement learning (RL) for large language models (LLMs) is attracting growing interest, driven by practical constraints in real-world applications, the complexity of LLM-RL infrastructure, and the need for further innovations of RL methodologies. While classic REINFORCE and its modern variants like Group Relative Policy Optimization (GRPO) are typically regarded as on-policy algorithms with limited tolerance of off-policyness, we present in this work a first-principles derivation for group-relative REINFORCE without assuming a specific training data distribution, showing that it admits a native off-policy interpretation. This perspective yields two general principles for adapting REINFORCE to off-policy settings: regularizing policy updates, and actively shaping the data distribution. Our analysis demystifies some myths about the roles of importance sampling and clipping in GRPO, unifies and reinterprets two recent algorithms -- Online Policy Mirror Descent (OPMD) and Asymmetric REINFORCE (AsymRE) -- as regularized forms of the REINFORCE loss, and offers theoretical justification for seemingly heuristic data-weighting strategies. Our findings lead to actionable insights that are validated with extensive empirical studies, and open up new opportunities for principled algorithm design in off-policy RL for LLMs. Source code for this work is available at https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k.
PDF62October 3, 2025