VAPO: 고급 추론 작업을 위한 효율적이고 신뢰할 수 있는 강화 학습

초록

우리는 가치 기반 패러다임 내에서 추론 모델을 위해 특별히 설계된 새로운 프레임워크인 VAPO(Value-based Augmented Proximal Policy Optimization)를 소개합니다. AIME 2024 데이터셋을 기준으로 평가한 VAPO는 Qwen 32B 사전 학습 모델을 기반으로 구축되어 60.4라는 최첨단 점수를 달성했습니다. 동일한 실험 설정에서 직접 비교했을 때, VAPO는 이전에 보고된 DeepSeek-R1-Zero-Qwen-32B 및 DAPO의 결과를 10점 이상 앞질렀습니다. VAPO의 학습 과정은 안정성과 효율성에서 두드러집니다. 단 5,000단계 만에 최첨단 성능에 도달하며, 여러 독립적인 실행에서도 학습 중단이 발생하지 않아 그 신뢰성을 입증했습니다. 본 연구는 가치 기반 강화 학습 프레임워크를 사용한 긴 사고 연쇄(long chain-of-thought, long-CoT) 추론을 심층적으로 탐구합니다. 우리는 가치 기반 방법을 괴롭히는 세 가지 주요 문제점, 즉 가치 모델 편향, 이질적 시퀀스 길이의 존재, 그리고 보상 신호의 희소성을 명확히 지적했습니다. 체계적인 설계를 통해 VAPO는 이러한 문제를 효과적으로 완화하는 통합 솔루션을 제공하며, long-CoT 추론 작업에서 향상된 성능을 가능하게 합니다.

English

We present VAPO, Value-based Augmented Proximal Policy Optimization framework for reasoning models., a novel framework tailored for reasoning models within the value-based paradigm. Benchmarked the AIME 2024 dataset, VAPO, built on the Qwen 32B pre-trained model, attains a state-of-the-art score of 60.4. In direct comparison under identical experimental settings, VAPO outperforms the previously reported results of DeepSeek-R1-Zero-Qwen-32B and DAPO by more than 10 points. The training process of VAPO stands out for its stability and efficiency. It reaches state-of-the-art performance within a mere 5,000 steps. Moreover, across multiple independent runs, no training crashes occur, underscoring its reliability. This research delves into long chain-of-thought (long-CoT) reasoning using a value-based reinforcement learning framework. We pinpoint three key challenges that plague value-based methods: value model bias, the presence of heterogeneous sequence lengths, and the sparsity of reward signals. Through systematic design, VAPO offers an integrated solution that effectively alleviates these challenges, enabling enhanced performance in long-CoT reasoning tasks.

VAPO: 고급 추론 작업을 위한 효율적이고 신뢰할 수 있는 강화 학습

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

초록

Support