당신의 언어 모델은 자체 비평가입니다: 행위자의 내부 상태를 통한 가치 추정 강화 학습

초록

검증 가능한 보상을 사용하는 강화 학습(RLVR)은 대규모 추론 모델에서 분산 감소를 위한 기준 추정에 의존하지만, 기존 접근 방식은 큰 대가를 치른다. PPO는 정책 모델 규모의 비평자가 필요하고, GRPO는 프롬프트당 여러 번의 롤아웃을 수행하여 경험적 그룹 평균을 안정적으로 유지해야 한다. 본 논문에서는 내부 상태 가치 추정을 통한 정책 최적화(POISE)를 제안한다. 이 방법은 정책 순방향 전파 과정에서 이미 계산된 정책 모델의 내부 신호를 활용하여 무시할 수 있는 비용으로 기준을 획득한다. 경량 프로브는 프롬프트와 생성된 궤적의 은닉 상태, 그리고 토큰 엔트로피 통계로부터 예상 검증 가능 보상을 예측하며, 정책과 함께 온라인으로 학습된다. 궤적 조건부 특성을 사용하면서도 그래디언트의 편향 없음을 유지하기 위해, 독립된 롤아웃의 내부 상태를 기반으로 각 롤아웃의 가치를 예측하는 교차 롤아웃 구성을 도입한다. POISE는 단일 롤아웃만으로 프롬프트 가치를 추정하기 때문에, 고정된 학습 계산 예산 내에서 더 높은 프롬프트 다양성을 가능하게 한다. 이는 그래디언트 분산을 줄여 더 안정적인 학습을 유도하고, 제로 어드밴티지 프롬프트를 탐지하기 위한 샘플링 비용의 계산 오버헤드도 제거한다. 수학 추론 벤치마크에서 Qwen3-4B와 DeepSeek-R1-Distill-Qwen-1.5B를 대상으로 실험한 결과, POISE는 더 적은 계산량으로 DAPO와 동등한 성능을 달성했다. 또한, 가치 추정기는 별도의 LLM 규모 가치 모델과 유사한 성능을 보였으며, 다양한 검증 가능한 작업에 일반화되었다. POISE는 모델 자체의 내부 표현을 활용함으로써 더 안정적이고 효율적인 정책 최적화를 가능하게 한다.

English

Reinforcement learning with verifiable rewards (RLVR) for Large Reasoning Models hinges on baseline estimation for variance reduction, but existing approaches pay a heavy price: PPO requires a policy-model scale critic, while GRPO needs multiple rollouts per prompt to keep its empirical group mean stable. We introduce Policy Optimization with Internal State Value Estimation), which obtains a baseline at negligible cost by using the policy model's internal signals already computed during the policy forward pass. A lightweight probe predicts the expected verifiable reward from the hidden states of the prompt and generated trajectory, as well as token-entropy statistics, and is trained online alongside the policy. To preserve gradient unbiasedness despite using trajectory-conditioned features, we introduce a cross-rollout construction that predicts each rollout's value from an independent rollout's internal states. Because POISE estimates prompt value using only a single rollout, it enables higher prompt diversity for a fixed compute budget during training. This reduces gradient variance for more stable learning and also eliminates the compute overhead of sampling costs for detecting zero-advantage prompts. On Qwen3-4B and DeepSeek-R1-Distill-Qwen-1.5B across math reasoning benchmarks, POISE matches DAPO while requiring less compute. Moreover, its value estimator shows similar performance to a separate LLM-scale value model and generalizes to various verifiable tasks. By leveraging the model's own internal representations, POISE enables more stable and efficient policy optimization.

당신의 언어 모델은 자체 비평가입니다: 행위자의 내부 상태를 통한 가치 추정 강화 학습

Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

초록

Support