群體相對REINFORCE實為離策略算法：揭開GRPO及其同類算法的一些迷思

摘要

針對大型語言模型（LLMs）的離策略強化學習（RL）正日益受到關注，這是由於現實應用中的實際限制、LLM-RL基礎設施的複雜性，以及對RL方法進一步創新的需求所驅動。雖然經典的REINFORCE及其現代變體如群組相對策略優化（GRPO）通常被視為對策略性算法，對離策略性的容忍度有限，但我們在本研究中從基本原理出發，推導了不假設特定訓練數據分佈的群組相對REINFORCE，展示了其天然具備的離策略解釋。這一視角為將REINFORCE適應於離策略設置提供了兩項通用原則：正則化策略更新，以及主動塑造數據分佈。我們的分析澄清了關於重要性採樣和裁剪在GRPO中作用的一些誤解，統一並重新解釋了兩種近期算法——在線策略鏡像下降（OPMD）和非對稱REINFORCE（AsymRE）——作為REINFORCE損失的正則化形式，並為看似啟發式的數據加權策略提供了理論依據。我們的研究成果提供了可操作的見解，並通過廣泛的實證研究得到驗證，為LLMs的離策略RL中的原則性算法設計開闢了新的機會。本工作的源代碼可在https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k獲取。

English

Off-policy reinforcement learning (RL) for large language models (LLMs) is attracting growing interest, driven by practical constraints in real-world applications, the complexity of LLM-RL infrastructure, and the need for further innovations of RL methodologies. While classic REINFORCE and its modern variants like Group Relative Policy Optimization (GRPO) are typically regarded as on-policy algorithms with limited tolerance of off-policyness, we present in this work a first-principles derivation for group-relative REINFORCE without assuming a specific training data distribution, showing that it admits a native off-policy interpretation. This perspective yields two general principles for adapting REINFORCE to off-policy settings: regularizing policy updates, and actively shaping the data distribution. Our analysis demystifies some myths about the roles of importance sampling and clipping in GRPO, unifies and reinterprets two recent algorithms -- Online Policy Mirror Descent (OPMD) and Asymmetric REINFORCE (AsymRE) -- as regularized forms of the REINFORCE loss, and offers theoretical justification for seemingly heuristic data-weighting strategies. Our findings lead to actionable insights that are validated with extensive empirical studies, and open up new opportunities for principled algorithm design in off-policy RL for LLMs. Source code for this work is available at https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k.

群體相對REINFORCE實為離策略算法：揭開GRPO及其同類算法的一些迷思

Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends

摘要

Support