強化学習におけるVRのための教師あり学習フレームワークを介した暗黙的アクター・クリティック結合

要旨

検証可能な報酬を用いた強化学習（RLVR）の最近の進展により、大規模言語モデル（LLM）が数学やプログラミングなどの難しい推論タスクに取り組む能力が強化されています。RLVRは、検証可能な結果報酬を活用してポリシー最適化を導き、LLMが確実かつ着実に出力品質を向上させることを可能にします。しかし、RLVRのパラダイムは大きな課題も抱えており、既存の手法では特にRLベースのアプローチにおいて、報酬信号が疎でポリシー勾配の更新が不安定になることがしばしば問題となります。これらの課題に対処するため、我々はPACSという新しいRLVRフレームワークを提案します。PACSは、教師あり学習フレームワークを通じて暗黙的なアクター・クリティック結合を実現します。結果報酬を予測可能なラベルとして扱うことで、RLVR問題をポリシーモデルによってパラメータ化されたスコア関数に対する教師あり学習タスクとして再定式化し、交差エントロピー損失を用いて最適化します。詳細な勾配分析により、この教師あり学習の定式化が古典的なポリシー勾配更新を本質的に回復し、アクターとクリティックの役割を暗黙的に結合することで、より安定かつ効率的なトレーニングを実現することが示されています。難しい数学的推論タスクでのベンチマークにおいて、PACSはPPOやGRPOなどの強力なRLVRベースラインを上回り、優れた推論性能を達成しています。例えば、PACSはAIME 2025においてpass@256で59.78%を達成し、PPOとGRPOに対してそれぞれ13.32ポイントと14.36ポイントの改善を示しています。このシンプルでありながら強力なフレームワークは、検証可能な報酬を用いたLLMのポストトレーニングにおいて有望な道筋を提供します。我々のコードとデータはhttps://github.com/ritzz-ai/PACSでオープンソースとして公開されています。

English

Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have empowered large language models (LLMs) to tackle challenging reasoning tasks such as mathematics and programming. RLVR leverages verifiable outcome rewards to guide policy optimization, enabling LLMs to progressively improve output quality in a grounded and reliable manner. Despite its promise, the RLVR paradigm poses significant challenges, as existing methods often suffer from sparse reward signals and unstable policy gradient updates, particularly in RL-based approaches. To address the challenges, we propose PACS, a novel RLVR framework that achieves imPlicit Actor Critic coupling via a Supervised learning framework. By treating the outcome reward as a predictable label, we reformulate the RLVR problem into a supervised learning task over a score function parameterized by the policy model and optimized using cross-entropy loss. A detailed gradient analysis shows that this supervised formulation inherently recovers the classical policy gradient update while implicitly coupling actor and critic roles, yielding more stable and efficient training. Benchmarking on challenging mathematical reasoning tasks, PACS outperforms strong RLVR baselines, such as PPO and GRPO, achieving superior reasoning performance. For instance, PACS achieves 59.78\% at pass@256 on AIME 2025, representing improvements of 13.32 and 14.36 points over PPO and GRPO. This simple yet powerful framework offers a promising avenue for LLMs post-training with verifiable rewards. Our code and data are available as open source at https://github.com/ritzz-ai/PACS.

強化学習におけるVRのための教師あり学習フレームワークを介した暗黙的アクター・クリティック結合

Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR

要旨

Support