切り捨て近接方策最適化

要旨

近年、テストタイムスケーリングを施した大規模言語モデル（LLMs）は、長い連鎖的思考（CoT）を生成することで、科学的および専門的タスクにおいて卓越した推論能力を示しています。これらの推論モデルを開発する上で重要な要素として、強化学習（RL）があり、特にProximal Policy Optimization（PPO）とその派生手法が、試行錯誤を通じてモデルを学習させることを可能にしています。しかし、PPOはその本質的なオンライン学習の性質により時間がかかる上に、応答長が増加することでさらにその問題が顕著になります。本研究では、PPOの新たな拡張手法であるTruncated Proximal Policy Optimization（T-PPO）を提案し、ポリシー更新と長さ制限付き応答生成を効率化することで学習効率を向上させます。T-PPOは、完全同期型の長文生成プロセスに内在するハードウェア利用率の低さという問題を緩和します。この問題は、完全なロールアウトを待つ間にリソースがしばしばアイドル状態になることに起因しています。我々の貢献は二つあります。第一に、不完全な応答から得られるアドバンテージ推定を維持しつつ、ポリシー学習の整合性を保つExtended Generalized Advantage Estimation（EGAE）を提案します。第二に、ポリシーモデルと価値モデルの独立した最適化を可能にする計算効率化メカニズムを考案します。このメカニズムは、プロンプトと切り捨てられたトークンを選択的にフィルタリングすることで、冗長な計算を削減し、収束性能を犠牲にすることなく学習プロセスを加速します。我々は、32Bのベースモデルを用いたAIME 2024においてT-PPOの有効性と効率性を実証しました。実験結果は、T-PPOが推論LLMsの学習効率を最大2.5倍向上させ、既存の競合手法を凌駕することを示しています。

English

Recently, test-time scaling Large Language Models (LLMs) have demonstrated exceptional reasoning capabilities across scientific and professional tasks by generating long chains-of-thought (CoT). As a crucial component for developing these reasoning models, reinforcement learning (RL), exemplified by Proximal Policy Optimization (PPO) and its variants, allows models to learn through trial and error. However, PPO can be time-consuming due to its inherent on-policy nature, which is further exacerbated by increasing response lengths. In this work, we propose Truncated Proximal Policy Optimization (T-PPO), a novel extension to PPO that improves training efficiency by streamlining policy update and length-restricted response generation. T-PPO mitigates the issue of low hardware utilization, an inherent drawback of fully synchronized long-generation procedures, where resources often sit idle during the waiting periods for complete rollouts. Our contributions are two-folds. First, we propose Extended Generalized Advantage Estimation (EGAE) for advantage estimation derived from incomplete responses while maintaining the integrity of policy learning. Second, we devise a computationally optimized mechanism that allows for the independent optimization of the policy and value models. By selectively filtering prompt and truncated tokens, this mechanism reduces redundant computations and accelerates the training process without sacrificing convergence performance. We demonstrate the effectiveness and efficacy of T-PPO on AIME 2024 with a 32B base model. The experimental results show that T-PPO improves the training efficiency of reasoning LLMs by up to 2.5x and outperforms its existing competitors.