PVPO:基于预估值策略优化的智能体推理方法
PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning
August 28, 2025
作者: Wenfeng Feng, Penghong Zhao, Guochao Jiang, Chuzhan Hao, Yuewei Zhang, Hao Wang
cs.AI
摘要
无批评强化学习方法,尤其是群体策略,因其在复杂任务中的高效性而备受关注。然而,这些方法严重依赖策略内的多次采样和比较来估计优势,这可能导致策略陷入局部最优并增加计算成本。为解决这些问题,我们提出了PVPO,一种通过优势参考锚点和数据预采样增强的高效强化学习方法。具体而言,我们利用参考模型提前进行rollout,并将计算得到的奖励分数作为参考锚点。我们的方法有效纠正了群体内比较引入的累积偏差,并显著减少了对rollout次数的依赖。同时,参考模型能够在数据预采样过程中评估样本难度,从而有效选择高增益数据以提高训练效率。在两个领域的九个数据集上进行的实验表明,PVPO实现了最先进的性能。我们的方法不仅在多个任务中展现出强大的泛化能力,还在不同规模的模型上表现出可扩展的性能。
English
Critic-free reinforcement learning methods, particularly group policies, have
attracted considerable attention for their efficiency in complex tasks.
However, these methods rely heavily on multiple sampling and comparisons within
the policy to estimate advantage, which may cause the policy to fall into local
optimum and increase computational cost. To address these issues, we propose
PVPO, an efficient reinforcement learning method enhanced by an advantage
reference anchor and data pre-sampling. Specifically, we use the reference
model to rollout in advance and employ the calculated reward score as a
reference anchor. Our approach effectively corrects the cumulative bias
introduced by intra-group comparisons and significantly reduces reliance on the
number of rollouts. Meanwhile, the reference model can assess sample difficulty
during data pre-sampling, enabling effective selection of high-gain data to
improve training efficiency. Experiments conducted on nine datasets across two
domains demonstrate that PVPO achieves State-Of-The-Art (SOTA) performance. Our
approach not only demonstrates robust generalization across multiple tasks, but
also exhibits scalable performance across models of varying scales.