그룹 상대적 REINFORCE는 사실 오프-폴리시 알고리즘이다: GRPO와 그 친구들에 대한 몇 가지 오해를 풀어보다

초록

대규모 언어 모델(LLM)을 위한 오프-폴리시 강화 학습(RL)은 실세계 애플리케이션의 실용적 제약, LLM-RL 인프라의 복잡성, 그리고 RL 방법론의 추가 혁신 필요성에 의해 점점 더 많은 관심을 받고 있다. 전통적인 REINFORCE와 그 현대적 변형인 그룹 상대적 정책 최적화(GRPO)는 일반적으로 제한된 오프-폴리시 내성을 가진 온-폴리시 알고리즘으로 간주되지만, 본 연구에서는 특정 훈련 데이터 분포를 가정하지 않고 그룹 상대적 REINFORCE의 원리 기반 유도를 제시하며, 이는 본질적으로 오프-폴리시 해석을 허용함을 보여준다. 이 관점은 REINFORCE를 오프-폴리시 설정에 적응시키기 위한 두 가지 일반 원칙을 제공한다: 정책 업데이트를 정규화하고, 데이터 분포를 능동적으로 형성하는 것이다. 우리의 분석은 GRPO에서 중요도 샘플링과 클리핑의 역할에 대한 몇 가지 오해를 해소하고, 최근 두 알고리즘인 온라인 정책 미러 디센트(OPMD)와 비대칭 REINFORCE(AsymRE)를 REINFORCE 손실의 정규화된 형태로 통합 및 재해석하며, 겉보기에는 휴리스틱한 데이터 가중치 전략에 대한 이론적 근거를 제공한다. 우리의 연구 결과는 광범위한 실증 연구를 통해 검증된 실행 가능한 통찰력을 제공하며, LLM을 위한 오프-폴리시 RL에서 원칙 기반 알고리즘 설계를 위한 새로운 기회를 열어준다. 본 연구의 소스 코드는 https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k에서 확인할 수 있다.

English

Off-policy reinforcement learning (RL) for large language models (LLMs) is attracting growing interest, driven by practical constraints in real-world applications, the complexity of LLM-RL infrastructure, and the need for further innovations of RL methodologies. While classic REINFORCE and its modern variants like Group Relative Policy Optimization (GRPO) are typically regarded as on-policy algorithms with limited tolerance of off-policyness, we present in this work a first-principles derivation for group-relative REINFORCE without assuming a specific training data distribution, showing that it admits a native off-policy interpretation. This perspective yields two general principles for adapting REINFORCE to off-policy settings: regularizing policy updates, and actively shaping the data distribution. Our analysis demystifies some myths about the roles of importance sampling and clipping in GRPO, unifies and reinterprets two recent algorithms -- Online Policy Mirror Descent (OPMD) and Asymmetric REINFORCE (AsymRE) -- as regularized forms of the REINFORCE loss, and offers theoretical justification for seemingly heuristic data-weighting strategies. Our findings lead to actionable insights that are validated with extensive empirical studies, and open up new opportunities for principled algorithm design in off-policy RL for LLMs. Source code for this work is available at https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k.

그룹 상대적 REINFORCE는 사실 오프-폴리시 알고리즘이다: GRPO와 그 친구들에 대한 몇 가지 오해를 풀어보다

Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends

초록

Support