강화 학습을 위한 지도 학습 프레임워크를 통한 암묵적 행위자-비평가 결합

초록

검증 가능한 보상을 활용한 강화 학습(Reinforcement Learning with Verifiable Rewards, RLVR)의 최근 발전은 대규모 언어 모델(LLMs)이 수학 및 프로그래밍과 같은 복잡한 추론 과제를 해결할 수 있도록 지원하고 있다. RLVR은 검증 가능한 결과 보상을 활용하여 정책 최적화를 안내함으로써, LLMs가 근거 있고 신뢰할 수 있는 방식으로 출력 품질을 점진적으로 개선할 수 있도록 한다. 그러나 RLVR 패러다임은 유망함에도 불구하고, 특히 RL 기반 접근법에서 희소한 보상 신호와 불안정한 정책 그래디언트 업데이트로 인해 상당한 어려움을 겪고 있다. 이러한 문제를 해결하기 위해, 우리는 PACS라는 새로운 RLVR 프레임워크를 제안한다. PACS는 감독 학습 프레임워크를 통해 암묵적 액터-크리틱 결합(Implicit Actor-Critic Coupling)을 달성한다. 결과 보상을 예측 가능한 레이블로 간주함으로써, RLVR 문제를 정책 모델에 의해 매개변수화되고 교차 엔트로피 손실을 사용하여 최적화되는 점수 함수에 대한 감독 학습 작업으로 재구성한다. 상세한 그래디어트 분석은 이 감독 학습 방식이 고전적인 정책 그래디언트 업데이트를 본질적으로 복구하면서도 액터와 크리틱 역할을 암묵적으로 결합하여 더 안정적이고 효율적인 학습을 가능하게 함을 보여준다. 복잡한 수학적 추론 과제에 대한 벤치마킹에서 PACS는 PPO 및 GRPO와 같은 강력한 RLVR 베이스라인을 능가하며 우수한 추론 성능을 달성한다. 예를 들어, PACS는 AIME 2025에서 pass@256 기준으로 59.78%를 달성하며, 이는 PPO 및 GRPO 대비 각각 13.32점 및 14.36점의 향상을 나타낸다. 이 간단하지만 강력한 프레임워크는 검증 가능한 보상을 활용한 LLMs의 사후 학습을 위한 유망한 방향을 제시한다. 우리의 코드와 데이터는 https://github.com/ritzz-ai/PACS에서 오픈 소스로 제공된다.

English

Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have empowered large language models (LLMs) to tackle challenging reasoning tasks such as mathematics and programming. RLVR leverages verifiable outcome rewards to guide policy optimization, enabling LLMs to progressively improve output quality in a grounded and reliable manner. Despite its promise, the RLVR paradigm poses significant challenges, as existing methods often suffer from sparse reward signals and unstable policy gradient updates, particularly in RL-based approaches. To address the challenges, we propose PACS, a novel RLVR framework that achieves imPlicit Actor Critic coupling via a Supervised learning framework. By treating the outcome reward as a predictable label, we reformulate the RLVR problem into a supervised learning task over a score function parameterized by the policy model and optimized using cross-entropy loss. A detailed gradient analysis shows that this supervised formulation inherently recovers the classical policy gradient update while implicitly coupling actor and critic roles, yielding more stable and efficient training. Benchmarking on challenging mathematical reasoning tasks, PACS outperforms strong RLVR baselines, such as PPO and GRPO, achieving superior reasoning performance. For instance, PACS achieves 59.78\% at pass@256 on AIME 2025, representing improvements of 13.32 and 14.36 points over PPO and GRPO. This simple yet powerful framework offers a promising avenue for LLMs post-training with verifiable rewards. Our code and data are available as open source at https://github.com/ritzz-ai/PACS.

강화 학습을 위한 지도 학습 프레임워크를 통한 암묵적 행위자-비평가 결합

Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR

초록

Support