截断近端策略优化

摘要

近期，测试时扩展的大型语言模型（LLMs）通过生成长链思维（CoT），在科学和专业任务中展现了卓越的推理能力。作为开发这些推理模型的关键组成部分，强化学习（RL），以近端策略优化（PPO）及其变体为代表，使模型能够通过试错进行学习。然而，PPO因其固有的在线策略性质，可能耗时较长，而随着响应长度的增加，这一问题进一步加剧。在本研究中，我们提出了截断近端策略优化（T-PPO），这是对PPO的一种新颖扩展，通过简化策略更新和长度受限的响应生成，提高了训练效率。T-PPO缓解了硬件利用率低的问题，这是完全同步的长生成过程固有的缺点，其中资源在等待完整回滚期间常常处于闲置状态。我们的贡献体现在两个方面。首先，我们提出了扩展广义优势估计（EGAE），用于从不完整响应中推导优势估计，同时保持策略学习的完整性。其次，我们设计了一种计算优化的机制，允许策略模型和价值模型独立优化。通过选择性过滤提示词和截断词，该机制减少了冗余计算，在不牺牲收敛性能的情况下加速了训练过程。我们在AIME 2024上使用32B基础模型验证了T-PPO的有效性和效率。实验结果表明，T-PPO将推理LLMs的训练效率提高了最多2.5倍，并超越了现有竞争对手。

English

Recently, test-time scaling Large Language Models (LLMs) have demonstrated exceptional reasoning capabilities across scientific and professional tasks by generating long chains-of-thought (CoT). As a crucial component for developing these reasoning models, reinforcement learning (RL), exemplified by Proximal Policy Optimization (PPO) and its variants, allows models to learn through trial and error. However, PPO can be time-consuming due to its inherent on-policy nature, which is further exacerbated by increasing response lengths. In this work, we propose Truncated Proximal Policy Optimization (T-PPO), a novel extension to PPO that improves training efficiency by streamlining policy update and length-restricted response generation. T-PPO mitigates the issue of low hardware utilization, an inherent drawback of fully synchronized long-generation procedures, where resources often sit idle during the waiting periods for complete rollouts. Our contributions are two-folds. First, we propose Extended Generalized Advantage Estimation (EGAE) for advantage estimation derived from incomplete responses while maintaining the integrity of policy learning. Second, we devise a computationally optimized mechanism that allows for the independent optimization of the policy and value models. By selectively filtering prompt and truncated tokens, this mechanism reduces redundant computations and accelerates the training process without sacrificing convergence performance. We demonstrate the effectiveness and efficacy of T-PPO on AIME 2024 with a 32B base model. The experimental results show that T-PPO improves the training efficiency of reasoning LLMs by up to 2.5x and outperforms its existing competitors.