PVPO: 에이전트 추론을 위한 사전 예측 가치 기반 정책 최적화

초록

비평가 없는 강화학습 방법론, 특히 그룹 정책은 복잡한 작업에서의 효율성으로 인해 상당한 주목을 받고 있다. 그러나 이러한 방법론은 정책 내에서의 다중 샘플링과 비교에 크게 의존하여 이점을 추정함으로써, 정책이 지역 최적점에 빠지거나 계산 비용이 증가할 수 있다는 문제점이 있다. 이러한 문제를 해결하기 위해, 우리는 이점 참조 앵커와 데이터 사전 샘플링을 통해 강화된 효율적인 강화학습 방법론인 PVPO를 제안한다. 구체적으로, 참조 모델을 사전에 롤아웃하여 계산된 보상 점수를 참조 앵커로 사용한다. 우리의 접근법은 그룹 내 비교로 인해 발생하는 누적 편향을 효과적으로 수정하고 롤아웃 횟수에 대한 의존성을 크게 줄인다. 동시에, 참조 모델은 데이터 사전 샘플링 과정에서 샘플의 난이도를 평가할 수 있어, 고수익 데이터를 효과적으로 선택하여 학습 효율성을 향상시킬 수 있다. 두 도메인에 걸친 아홉 개의 데이터셋에서 수행된 실험은 PVPO가 최첨단(SOTA) 성능을 달성함을 보여준다. 우리의 접근법은 다중 작업에서 강력한 일반화 능력을 보여줄 뿐만 아니라, 다양한 규모의 모델에서 확장 가능한 성능을 나타낸다.

English

Critic-free reinforcement learning methods, particularly group policies, have attracted considerable attention for their efficiency in complex tasks. However, these methods rely heavily on multiple sampling and comparisons within the policy to estimate advantage, which may cause the policy to fall into local optimum and increase computational cost. To address these issues, we propose PVPO, an efficient reinforcement learning method enhanced by an advantage reference anchor and data pre-sampling. Specifically, we use the reference model to rollout in advance and employ the calculated reward score as a reference anchor. Our approach effectively corrects the cumulative bias introduced by intra-group comparisons and significantly reduces reliance on the number of rollouts. Meanwhile, the reference model can assess sample difficulty during data pre-sampling, enabling effective selection of high-gain data to improve training efficiency. Experiments conducted on nine datasets across two domains demonstrate that PVPO achieves State-Of-The-Art (SOTA) performance. Our approach not only demonstrates robust generalization across multiple tasks, but also exhibits scalable performance across models of varying scales.

PVPO: 에이전트 추론을 위한 사전 예측 가치 기반 정책 최적화

PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning

초록

Support