PVPO:基於預估值的智能體推理策略優化
PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning
August 28, 2025
作者: Wenfeng Feng, Penghong Zhao, Guochao Jiang, Chuzhan Hao, Yuewei Zhang, Hao Wang
cs.AI
摘要
無需批評者的強化學習方法,尤其是群體策略,因其在複雜任務中的高效性而受到廣泛關注。然而,這些方法在策略內部依賴於多次採樣與比較來估計優勢,這可能導致策略陷入局部最優並增加計算成本。為解決這些問題,我們提出了PVPO,一種通過優勢參考錨點與數據預採樣增強的高效強化學習方法。具體而言,我們利用參考模型提前進行滾動,並將計算出的獎勵分數作為參考錨點。我們的方法有效糾正了由群內比較引入的累積偏差,並顯著減少了對滾動次數的依賴。同時,參考模型在數據預採樣過程中能夠評估樣本難度,從而有效選擇高增益數據以提高訓練效率。在兩個領域的九個數據集上進行的實驗表明,PVPO達到了最先進(SOTA)的性能。我們的方法不僅在多個任務中展現出強大的泛化能力,還在不同規模的模型上表現出可擴展的性能。
English
Critic-free reinforcement learning methods, particularly group policies, have
attracted considerable attention for their efficiency in complex tasks.
However, these methods rely heavily on multiple sampling and comparisons within
the policy to estimate advantage, which may cause the policy to fall into local
optimum and increase computational cost. To address these issues, we propose
PVPO, an efficient reinforcement learning method enhanced by an advantage
reference anchor and data pre-sampling. Specifically, we use the reference
model to rollout in advance and employ the calculated reward score as a
reference anchor. Our approach effectively corrects the cumulative bias
introduced by intra-group comparisons and significantly reduces reliance on the
number of rollouts. Meanwhile, the reference model can assess sample difficulty
during data pre-sampling, enabling effective selection of high-gain data to
improve training efficiency. Experiments conducted on nine datasets across two
domains demonstrate that PVPO achieves State-Of-The-Art (SOTA) performance. Our
approach not only demonstrates robust generalization across multiple tasks, but
also exhibits scalable performance across models of varying scales.