ChatPaper.aiChatPaper

你的语言模型即是自身的评判者:基于从行动者内部状态进行价值估计的强化学习

Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

May 8, 2026
作者: Yunho Choi, Jongwon Lim, Woojin Ahn, Minjae Oh, Jeonghoon Shim, Yohan Jo
cs.AI

摘要

基于可验证奖励的强化学习(RLVR)在大规模推理模型中的应用依赖于基线估计以实现方差缩减,但现有方法代价高昂:PPO需要一个策略模型规模的评论家,而GRPO每轮提示需要多次采样以保持经验组均值稳定。我们提出一种基于内部状态值估计的策略优化方法,利用策略模型前向传播过程中已计算出的内部信号,以极低成本获取基线。一个轻量级探测头从提示和生成轨迹的隐藏状态以及令牌熵统计信息中预测期望可验证奖励,并与策略一同在线训练。为避免使用轨迹条件化特征导致梯度有偏,我们引入一种跨轨迹构造方法,利用独立轨迹的内部状态预测每条轨迹的值。由于POISE仅需单次采样即可估计提示值,它在固定计算预算下能实现更高的提示多样性,从而降低梯度方差以获得更稳定的学习,同时消除了为检测零优势提示而进行的采样计算开销。在数学推理基准上,基于Qwen3-4B和DeepSeek-R1-Distill-Qwen-1.5B的实验中,POISE在实现与DAPO相当性能的同时降低了计算需求。此外,其值估计器在性能上可与独立的大语言模型规模值模型相媲美,并能泛化至多种可验证任务。通过利用模型自身的内部表征,POISE实现了更稳定高效的政策优化。
English
Reinforcement learning with verifiable rewards (RLVR) for Large Reasoning Models hinges on baseline estimation for variance reduction, but existing approaches pay a heavy price: PPO requires a policy-model scale critic, while GRPO needs multiple rollouts per prompt to keep its empirical group mean stable. We introduce Policy Optimization with Internal State Value Estimation), which obtains a baseline at negligible cost by using the policy model's internal signals already computed during the policy forward pass. A lightweight probe predicts the expected verifiable reward from the hidden states of the prompt and generated trajectory, as well as token-entropy statistics, and is trained online alongside the policy. To preserve gradient unbiasedness despite using trajectory-conditioned features, we introduce a cross-rollout construction that predicts each rollout's value from an independent rollout's internal states. Because POISE estimates prompt value using only a single rollout, it enables higher prompt diversity for a fixed compute budget during training. This reduces gradient variance for more stable learning and also eliminates the compute overhead of sampling costs for detecting zero-advantage prompts. On Qwen3-4B and DeepSeek-R1-Distill-Qwen-1.5B across math reasoning benchmarks, POISE matches DAPO while requiring less compute. Moreover, its value estimator shows similar performance to a separate LLM-scale value model and generalizes to various verifiable tasks. By leveraging the model's own internal representations, POISE enables more stable and efficient policy optimization.
PDF133May 14, 2026