PVPO：基於預估值的智能體推理策略優化

摘要

無需批評者的強化學習方法，尤其是群體策略，因其在複雜任務中的高效性而受到廣泛關注。然而，這些方法在策略內部依賴於多次採樣與比較來估計優勢，這可能導致策略陷入局部最優並增加計算成本。為解決這些問題，我們提出了PVPO，一種通過優勢參考錨點與數據預採樣增強的高效強化學習方法。具體而言，我們利用參考模型提前進行滾動，並將計算出的獎勵分數作為參考錨點。我們的方法有效糾正了由群內比較引入的累積偏差，並顯著減少了對滾動次數的依賴。同時，參考模型在數據預採樣過程中能夠評估樣本難度，從而有效選擇高增益數據以提高訓練效率。在兩個領域的九個數據集上進行的實驗表明，PVPO達到了最先進（SOTA）的性能。我們的方法不僅在多個任務中展現出強大的泛化能力，還在不同規模的模型上表現出可擴展的性能。

English

Critic-free reinforcement learning methods, particularly group policies, have attracted considerable attention for their efficiency in complex tasks. However, these methods rely heavily on multiple sampling and comparisons within the policy to estimate advantage, which may cause the policy to fall into local optimum and increase computational cost. To address these issues, we propose PVPO, an efficient reinforcement learning method enhanced by an advantage reference anchor and data pre-sampling. Specifically, we use the reference model to rollout in advance and employ the calculated reward score as a reference anchor. Our approach effectively corrects the cumulative bias introduced by intra-group comparisons and significantly reduces reliance on the number of rollouts. Meanwhile, the reference model can assess sample difficulty during data pre-sampling, enabling effective selection of high-gain data to improve training efficiency. Experiments conducted on nine datasets across two domains demonstrate that PVPO achieves State-Of-The-Art (SOTA) performance. Our approach not only demonstrates robust generalization across multiple tasks, but also exhibits scalable performance across models of varying scales.

PVPO：基於預估值的智能體推理策略優化

PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning

摘要

Support