ChatPaper.aiChatPaper

你的語言模型是自己的評論家:從演員內部狀態進行價值估計的強化學習

Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

May 8, 2026
作者: Yunho Choi, Jongwon Lim, Woojin Ahn, Minjae Oh, Jeonghoon Shim, Yohan Jo
cs.AI

摘要

基於可驗證獎勵的強化學習(RLVR)在大型推理模型中的應用,其核心在於透過基線估計來降低變異數,但現有方法付出了高昂代價:PPO需要一個與策略模型規模相當的評論家模型,而GRPO則需對每個提示進行多次推演,以維持其經驗群體均值的穩定。我們提出「基於內部狀態值估計的策略優化」(POISE),該方法利用策略模型在前向傳播過程中已計算出的內部訊號,以極低成本獲得基線。一個輕量探測器根據提示與生成軌跡的隱藏狀態以及詞元熵統計量,來預測預期的可驗證獎勵,並與策略一同進行線上訓練。為在使用軌跡條件特徵時保持梯度無偏性,我們引入跨推演的建構方式,即從獨立推演的內部狀態來預測每個推演的價值。由於POISE僅需單次推演即可估計提示價值,因此在訓練過程中能在固定計算預算下實現更高的提示多樣性。這不僅降低了梯度變異數,使學習更穩定,同時也消除了為檢測零優勢提示而進行取樣所帶來的計算開銷。在Qwen3-4B和DeepSeek-R1-Distill-Qwen-1.5B上,針對多個數學推理基準,POISE在減少計算量的同時達到了與DAPO相當的性能。此外,其價值估計器的表現與單獨的LLM規模價值模型相近,並能泛化至多種可驗證任務。POISE藉助模型自身的內部表徵,實現了更穩定且高效的策略優化。
English
Reinforcement learning with verifiable rewards (RLVR) for Large Reasoning Models hinges on baseline estimation for variance reduction, but existing approaches pay a heavy price: PPO requires a policy-model scale critic, while GRPO needs multiple rollouts per prompt to keep its empirical group mean stable. We introduce Policy Optimization with Internal State Value Estimation), which obtains a baseline at negligible cost by using the policy model's internal signals already computed during the policy forward pass. A lightweight probe predicts the expected verifiable reward from the hidden states of the prompt and generated trajectory, as well as token-entropy statistics, and is trained online alongside the policy. To preserve gradient unbiasedness despite using trajectory-conditioned features, we introduce a cross-rollout construction that predicts each rollout's value from an independent rollout's internal states. Because POISE estimates prompt value using only a single rollout, it enables higher prompt diversity for a fixed compute budget during training. This reduces gradient variance for more stable learning and also eliminates the compute overhead of sampling costs for detecting zero-advantage prompts. On Qwen3-4B and DeepSeek-R1-Distill-Qwen-1.5B across math reasoning benchmarks, POISE matches DAPO while requiring less compute. Moreover, its value estimator shows similar performance to a separate LLM-scale value model and generalizes to various verifiable tasks. By leveraging the model's own internal representations, POISE enables more stable and efficient policy optimization.
PDF133May 14, 2026