你的語言模型是自己的評論家：從演員內部狀態進行價值估計的強化學習

摘要

基於可驗證獎勵的強化學習（RLVR）在大型推理模型中的應用，其核心在於透過基線估計來降低變異數，但現有方法付出了高昂代價：PPO需要一個與策略模型規模相當的評論家模型，而GRPO則需對每個提示進行多次推演，以維持其經驗群體均值的穩定。我們提出「基於內部狀態值估計的策略優化」（POISE），該方法利用策略模型在前向傳播過程中已計算出的內部訊號，以極低成本獲得基線。一個輕量探測器根據提示與生成軌跡的隱藏狀態以及詞元熵統計量，來預測預期的可驗證獎勵，並與策略一同進行線上訓練。為在使用軌跡條件特徵時保持梯度無偏性，我們引入跨推演的建構方式，即從獨立推演的內部狀態來預測每個推演的價值。由於POISE僅需單次推演即可估計提示價值，因此在訓練過程中能在固定計算預算下實現更高的提示多樣性。這不僅降低了梯度變異數，使學習更穩定，同時也消除了為檢測零優勢提示而進行取樣所帶來的計算開銷。在Qwen3-4B和DeepSeek-R1-Distill-Qwen-1.5B上，針對多個數學推理基準，POISE在減少計算量的同時達到了與DAPO相當的性能。此外，其價值估計器的表現與單獨的LLM規模價值模型相近，並能泛化至多種可驗證任務。POISE藉助模型自身的內部表徵，實現了更穩定且高效的策略優化。

English

Reinforcement learning with verifiable rewards (RLVR) for Large Reasoning Models hinges on baseline estimation for variance reduction, but existing approaches pay a heavy price: PPO requires a policy-model scale critic, while GRPO needs multiple rollouts per prompt to keep its empirical group mean stable. We introduce Policy Optimization with Internal State Value Estimation), which obtains a baseline at negligible cost by using the policy model's internal signals already computed during the policy forward pass. A lightweight probe predicts the expected verifiable reward from the hidden states of the prompt and generated trajectory, as well as token-entropy statistics, and is trained online alongside the policy. To preserve gradient unbiasedness despite using trajectory-conditioned features, we introduce a cross-rollout construction that predicts each rollout's value from an independent rollout's internal states. Because POISE estimates prompt value using only a single rollout, it enables higher prompt diversity for a fixed compute budget during training. This reduces gradient variance for more stable learning and also eliminates the compute overhead of sampling costs for detecting zero-advantage prompts. On Qwen3-4B and DeepSeek-R1-Distill-Qwen-1.5B across math reasoning benchmarks, POISE matches DAPO while requiring less compute. Moreover, its value estimator shows similar performance to a separate LLM-scale value model and generalizes to various verifiable tasks. By leveraging the model's own internal representations, POISE enables more stable and efficient policy optimization.

你的語言模型是自己的評論家：從演員內部狀態進行價值估計的強化學習

Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

摘要

Support