基于监督学习框架的隐式演员-评论家耦合在RLVR中的应用
Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR
September 2, 2025
作者: Jiaming Li, Longze Chen, Ze Gong, Yukun Chen, Lu Wang, Wanwei He, Run Luo, Min Yang
cs.AI
摘要
近期,基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)技术取得显著进展,使得大型语言模型(LLMs)能够应对数学与编程等复杂推理任务。RLVR通过利用可验证结果奖励来指导策略优化,使LLMs能够以扎实可靠的方式逐步提升输出质量。尽管前景广阔,RLVR范式仍面临重大挑战,现有方法常受限于稀疏的奖励信号及不稳定的策略梯度更新,尤其是在基于强化学习的方法中。为应对这些挑战,我们提出了PACS,一种新颖的RLVR框架,它通过监督学习框架实现了隐式演员-评论家耦合。通过将结果奖励视为可预测标签,我们将RLVR问题重新表述为针对由策略模型参数化的评分函数的监督学习任务,并采用交叉熵损失进行优化。详细的梯度分析表明,这一监督式表述不仅自然恢复了经典策略梯度更新,还隐式耦合了演员与评论家角色,从而实现了更稳定高效的训练。在具有挑战性的数学推理任务基准测试中,PACS超越了PPO和GRPO等强RLVR基线,展现出卓越的推理性能。例如,在AIME 2025的pass@256指标上,PACS达到了59.78%,较PPO和GRPO分别提升了13.32和14.36个百分点。这一简洁而强大的框架为LLMs在可验证奖励下的后续训练开辟了一条充满希望的道路。我们的代码与数据已开源,详见https://github.com/ritzz-ai/PACS。
English
Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have
empowered large language models (LLMs) to tackle challenging reasoning tasks
such as mathematics and programming. RLVR leverages verifiable outcome rewards
to guide policy optimization, enabling LLMs to progressively improve output
quality in a grounded and reliable manner. Despite its promise, the RLVR
paradigm poses significant challenges, as existing methods often suffer from
sparse reward signals and unstable policy gradient updates, particularly in
RL-based approaches. To address the challenges, we propose PACS, a
novel RLVR framework that achieves imPlicit Actor
Critic coupling via a Supervised learning framework. By
treating the outcome reward as a predictable label, we reformulate the RLVR
problem into a supervised learning task over a score function parameterized by
the policy model and optimized using cross-entropy loss. A detailed gradient
analysis shows that this supervised formulation inherently recovers the
classical policy gradient update while implicitly coupling actor and critic
roles, yielding more stable and efficient training. Benchmarking on challenging
mathematical reasoning tasks, PACS outperforms strong RLVR baselines, such as
PPO and GRPO, achieving superior reasoning performance. For instance, PACS
achieves 59.78\% at pass@256 on AIME 2025, representing improvements of 13.32
and 14.36 points over PPO and GRPO. This simple yet powerful framework offers a
promising avenue for LLMs post-training with verifiable rewards. Our code and
data are available as open source at https://github.com/ritzz-ai/PACS.