절단 근위 정책 최적화

초록

최근, 테스트 시간 스케일링 대형 언어 모델(LLMs)은 긴 사고의 연쇄(CoT)를 생성함으로써 과학적 및 전문적 과제에서 탁월한 추론 능력을 보여주고 있다. 이러한 추론 모델 개발의 핵심 요소로, 근위 정책 최적화(PPO) 및 그 변형으로 대표되는 강화 학습(RL)은 모델이 시행착오를 통해 학습할 수 있게 한다. 그러나 PPO는 본질적인 온-정책 특성으로 인해 시간이 많이 소요될 수 있으며, 이는 응답 길이가 증가함에 따라 더욱 악화된다. 본 연구에서는 정책 업데이트와 길이 제한 응답 생성을 간소화하여 훈련 효율성을 향상시키는 PPO의 새로운 확장인 Truncated Proximal Policy Optimization(T-PPO)을 제안한다. T-PPO는 완전히 동기화된 장기 생성 절차에서 자원이 전체 롤아웃을 기다리는 동안 유휴 상태에 머무르는 하드웨어 활용도 저하 문제를 완화한다. 우리의 기여는 두 가지이다. 첫째, 불완전한 응답에서 도출된 이점 추정을 위한 Extended Generalized Advantage Estimation(EGAE)을 제안하며, 정책 학습의 무결성을 유지한다. 둘째, 정책 모델과 가치 모델의 독립적인 최적화를 가능하게 하는 계산적으로 최적화된 메커니즘을 고안한다. 이 메커니즘은 프롬프트와 잘린 토큰을 선택적으로 필터링하여 불필요한 계산을 줄이고 수렴 성능을 희생하지 않으면서 훈련 과정을 가속화한다. 우리는 32B 기본 모델을 사용한 AIME 2024에서 T-PPO의 효과성과 효율성을 입증한다. 실험 결과는 T-PPO가 추론 LLMs의 훈련 효율성을 최대 2.5배 향상시키며 기존 경쟁자들을 능가함을 보여준다.

English

Recently, test-time scaling Large Language Models (LLMs) have demonstrated exceptional reasoning capabilities across scientific and professional tasks by generating long chains-of-thought (CoT). As a crucial component for developing these reasoning models, reinforcement learning (RL), exemplified by Proximal Policy Optimization (PPO) and its variants, allows models to learn through trial and error. However, PPO can be time-consuming due to its inherent on-policy nature, which is further exacerbated by increasing response lengths. In this work, we propose Truncated Proximal Policy Optimization (T-PPO), a novel extension to PPO that improves training efficiency by streamlining policy update and length-restricted response generation. T-PPO mitigates the issue of low hardware utilization, an inherent drawback of fully synchronized long-generation procedures, where resources often sit idle during the waiting periods for complete rollouts. Our contributions are two-folds. First, we propose Extended Generalized Advantage Estimation (EGAE) for advantage estimation derived from incomplete responses while maintaining the integrity of policy learning. Second, we devise a computationally optimized mechanism that allows for the independent optimization of the policy and value models. By selectively filtering prompt and truncated tokens, this mechanism reduces redundant computations and accelerates the training process without sacrificing convergence performance. We demonstrate the effectiveness and efficacy of T-PPO on AIME 2024 with a 32B base model. The experimental results show that T-PPO improves the training efficiency of reasoning LLMs by up to 2.5x and outperforms its existing competitors.