VinePPO：通过精细的信用分配释放LLM推理的RL潜力

摘要

大型语言模型（LLMs）越来越多地应用于需要执行多个复杂步骤才能获得奖励的复杂推理任务。正确地为这些步骤分配信用对于提高模型性能至关重要。Proximal Policy Optimization（PPO）是一种用于LLM微调的最先进的强化学习（RL）算法，它使用价值网络来处理信用分配。然而，在复杂推理任务中，价值网络在准确预测预期累积奖励方面面临挑战，通常导致高方差更新和次优性能。在这项工作中，我们系统评估了价值网络的有效性，并揭示了它们在重推理LLM任务中的重大缺陷，表明在比较替代步骤时，它们几乎只能略胜一筹。为了解决这个问题，我们提出了VinePPO，这是一种简单的方法，利用语言环境的灵活性来计算无偏的基于蒙特卡洛的估计，从而避免了对大型价值网络的需求。我们的方法在MATH和GSM8K数据集上始终优于PPO和其他无RL基线，而且梯度更新次数更少（高达9倍），墙钟时间更短（高达3.0倍）。这些结果强调了在LLM的RL微调中准确的信用分配的重要性，并展示了VinePPO作为一个更优越替代方案的潜力。

English

Large language models (LLMs) are increasingly applied to complex reasoning tasks that require executing several complex steps before receiving any reward. Properly assigning credit to these steps is essential for enhancing model performance. Proximal Policy Optimization (PPO), a state-of-the-art reinforcement learning (RL) algorithm used for LLM finetuning, employs value networks to tackle credit assignment. However, value networks face challenges in predicting the expected cumulative rewards accurately in complex reasoning tasks, often leading to high-variance updates and suboptimal performance. In this work, we systematically evaluate the efficacy of value networks and reveal their significant shortcomings in reasoning-heavy LLM tasks, showing that they barely outperform a random baseline when comparing alternative steps. To address this, we propose VinePPO, a straightforward approach that leverages the flexibility of language environments to compute unbiased Monte Carlo-based estimates, bypassing the need for large value networks. Our method consistently outperforms PPO and other RL-free baselines across MATH and GSM8K datasets with fewer gradient updates (up to 9x), less wall-clock time (up to 3.0x). These results emphasize the importance of accurate credit assignment in RL finetuning of LLM and demonstrate VinePPO's potential as a superior alternative.

VinePPO：通过精细的信用分配释放LLM推理的RL潜力

VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment

摘要

Support