基於監督學習框架的隱含行動者-評論家耦合於RLVR中的應用

摘要

近期，基於可驗證獎勵的強化學習（Reinforcement Learning with Verifiable Rewards, RLVR）的進展，賦予了大規模語言模型（Large Language Models, LLMs）解決如數學和編程等複雜推理任務的能力。RLVR利用可驗證的結果獎勵來指導策略優化，使LLMs能夠以一種紮實且可靠的方式逐步提升輸出質量。儘管前景廣闊，RLVR範式仍面臨重大挑戰，現有方法常受制於稀疏的獎勵信號和不穩定的策略梯度更新，尤其是在基於強化學習的方法中。為應對這些挑戰，我們提出了PACS，這是一種新穎的RLVR框架，通過監督學習框架實現了隱含的演員-評論家耦合。通過將結果獎勵視為可預測的標籤，我們將RLVR問題重新表述為對由策略模型參數化的得分函數進行監督學習的任務，並使用交叉熵損失進行優化。詳細的梯度分析表明，這種監督式表述本質上恢復了經典的策略梯度更新，同時隱含地耦合了演員和評論家的角色，從而實現了更穩定和高效的訓練。在具有挑戰性的數學推理任務上的基準測試中，PACS超越了如PPO和GRPO等強勁的RLVR基線，展現出卓越的推理性能。例如，在AIME 2025上，PACS在pass@256指標上達到了59.78%，相較於PPO和GRPO分別提升了13.32和14.36個百分點。這一簡潔而強大的框架為LLMs在可驗證獎勵下的後續訓練提供了一條充滿希望的道路。我們的代碼和數據已作為開源項目提供於https://github.com/ritzz-ai/PACS。

English

Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have empowered large language models (LLMs) to tackle challenging reasoning tasks such as mathematics and programming. RLVR leverages verifiable outcome rewards to guide policy optimization, enabling LLMs to progressively improve output quality in a grounded and reliable manner. Despite its promise, the RLVR paradigm poses significant challenges, as existing methods often suffer from sparse reward signals and unstable policy gradient updates, particularly in RL-based approaches. To address the challenges, we propose PACS, a novel RLVR framework that achieves imPlicit Actor Critic coupling via a Supervised learning framework. By treating the outcome reward as a predictable label, we reformulate the RLVR problem into a supervised learning task over a score function parameterized by the policy model and optimized using cross-entropy loss. A detailed gradient analysis shows that this supervised formulation inherently recovers the classical policy gradient update while implicitly coupling actor and critic roles, yielding more stable and efficient training. Benchmarking on challenging mathematical reasoning tasks, PACS outperforms strong RLVR baselines, such as PPO and GRPO, achieving superior reasoning performance. For instance, PACS achieves 59.78\% at pass@256 on AIME 2025, representing improvements of 13.32 and 14.36 points over PPO and GRPO. This simple yet powerful framework offers a promising avenue for LLMs post-training with verifiable rewards. Our code and data are available as open source at https://github.com/ritzz-ai/PACS.

基於監督學習框架的隱含行動者-評論家耦合於RLVR中的應用

Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR

摘要

Support