PVPO：エージェント推論のための事前推定値に基づくポリシー最適化

要旨

批評者不要の強化学習手法、特にグループポリシーは、複雑なタスクにおける効率性から大きな注目を集めている。しかし、これらの手法は、利点を推定するためにポリシー内での複数のサンプリングと比較に大きく依存しており、これがポリシーを局所最適に陥らせ、計算コストを増大させる可能性がある。これらの問題に対処するため、我々は利点参照アンカーとデータ事前サンプリングを強化した効率的な強化学習手法であるPVPOを提案する。具体的には、参照モデルを事前にロールアウトし、計算された報酬スコアを参照アンカーとして使用する。我々のアプローチは、グループ内比較によって導入される累積バイアスを効果的に補正し、ロールアウト回数への依存を大幅に削減する。同時に、参照モデルはデータ事前サンプリング中にサンプルの難易度を評価し、高利得データを効果的に選択してトレーニング効率を向上させる。2つのドメインにわたる9つのデータセットで実施された実験は、PVPOがState-Of-The-Art（SOTA）の性能を達成することを示している。我々のアプローチは、複数のタスクにわたる堅牢な汎化性能を示すだけでなく、異なるスケールのモデルにわたるスケーラブルな性能も示す。

English

Critic-free reinforcement learning methods, particularly group policies, have attracted considerable attention for their efficiency in complex tasks. However, these methods rely heavily on multiple sampling and comparisons within the policy to estimate advantage, which may cause the policy to fall into local optimum and increase computational cost. To address these issues, we propose PVPO, an efficient reinforcement learning method enhanced by an advantage reference anchor and data pre-sampling. Specifically, we use the reference model to rollout in advance and employ the calculated reward score as a reference anchor. Our approach effectively corrects the cumulative bias introduced by intra-group comparisons and significantly reduces reliance on the number of rollouts. Meanwhile, the reference model can assess sample difficulty during data pre-sampling, enabling effective selection of high-gain data to improve training efficiency. Experiments conducted on nine datasets across two domains demonstrate that PVPO achieves State-Of-The-Art (SOTA) performance. Our approach not only demonstrates robust generalization across multiple tasks, but also exhibits scalable performance across models of varying scales.

PVPO：エージェント推論のための事前推定値に基づくポリシー最適化

PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning

要旨

Support