ChatPaper.aiChatPaper

基于监督学习框架的隐式演员-评论家耦合在RLVR中的应用

Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR

September 2, 2025
作者: Jiaming Li, Longze Chen, Ze Gong, Yukun Chen, Lu Wang, Wanwei He, Run Luo, Min Yang
cs.AI

摘要

近期,基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)技术取得显著进展,使得大型语言模型(LLMs)能够应对数学与编程等复杂推理任务。RLVR通过利用可验证结果奖励来指导策略优化,使LLMs能够以扎实可靠的方式逐步提升输出质量。尽管前景广阔,RLVR范式仍面临重大挑战,现有方法常受限于稀疏的奖励信号及不稳定的策略梯度更新,尤其是在基于强化学习的方法中。为应对这些挑战,我们提出了PACS,一种新颖的RLVR框架,它通过监督学习框架实现了隐式演员-评论家耦合。通过将结果奖励视为可预测标签,我们将RLVR问题重新表述为针对由策略模型参数化的评分函数的监督学习任务,并采用交叉熵损失进行优化。详细的梯度分析表明,这一监督式表述不仅自然恢复了经典策略梯度更新,还隐式耦合了演员与评论家角色,从而实现了更稳定高效的训练。在具有挑战性的数学推理任务基准测试中,PACS超越了PPO和GRPO等强RLVR基线,展现出卓越的推理性能。例如,在AIME 2025的pass@256指标上,PACS达到了59.78%,较PPO和GRPO分别提升了13.32和14.36个百分点。这一简洁而强大的框架为LLMs在可验证奖励下的后续训练开辟了一条充满希望的道路。我们的代码与数据已开源,详见https://github.com/ritzz-ai/PACS。
English
Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have empowered large language models (LLMs) to tackle challenging reasoning tasks such as mathematics and programming. RLVR leverages verifiable outcome rewards to guide policy optimization, enabling LLMs to progressively improve output quality in a grounded and reliable manner. Despite its promise, the RLVR paradigm poses significant challenges, as existing methods often suffer from sparse reward signals and unstable policy gradient updates, particularly in RL-based approaches. To address the challenges, we propose PACS, a novel RLVR framework that achieves imPlicit Actor Critic coupling via a Supervised learning framework. By treating the outcome reward as a predictable label, we reformulate the RLVR problem into a supervised learning task over a score function parameterized by the policy model and optimized using cross-entropy loss. A detailed gradient analysis shows that this supervised formulation inherently recovers the classical policy gradient update while implicitly coupling actor and critic roles, yielding more stable and efficient training. Benchmarking on challenging mathematical reasoning tasks, PACS outperforms strong RLVR baselines, such as PPO and GRPO, achieving superior reasoning performance. For instance, PACS achieves 59.78\% at pass@256 on AIME 2025, representing improvements of 13.32 and 14.36 points over PPO and GRPO. This simple yet powerful framework offers a promising avenue for LLMs post-training with verifiable rewards. Our code and data are available as open source at https://github.com/ritzz-ai/PACS.
PDF222September 3, 2025