利用强化学习教授大型语言模型推理

摘要

从人类反馈中进行强化学习（RLHF）已成为将LLM输出与人类偏好对齐的主要方法。受RLHF成功的启发，我们研究了多种学习从反馈中获得的算法（专家迭代、近端策略优化（PPO）、回报条件化RL）在提高LLM推理能力方面的表现。我们研究了启发式和通过学习奖励模型提供给LLM的稀疏和稠密奖励。此外，我们从多个模型大小和初始化开始，有的进行了监督微调（SFT）数据，有的没有。总体而言，我们发现所有算法的表现相当，专家迭代在大多数情况下表现最佳。令人惊讶的是，我们发现专家迭代的样本复杂度与PPO相似，从预训练检查点收敛最多需要约10^6个样本。我们调查了这种情况的原因，得出结论在RL训练期间，模型未能探索明显超出SFT模型已产生的解决方案。此外，我们讨论了在SFT训练期间maj@1和pass@96指标表现之间的权衡，以及相反RL训练如何同时改善两者。最后，我们讨论了我们的发现对RLHF以及RL在LLM微调中未来角色的影响。

English

Reinforcement Learning from Human Feedback (RLHF) has emerged as a dominant approach for aligning LLM outputs with human preferences. Inspired by the success of RLHF, we study the performance of multiple algorithms that learn from feedback (Expert Iteration, Proximal Policy Optimization (PPO), Return-Conditioned RL) on improving LLM reasoning capabilities. We investigate both sparse and dense rewards provided to the LLM both heuristically and via a learned reward model. We additionally start from multiple model sizes and initializations both with and without supervised fine-tuning (SFT) data. Overall, we find all algorithms perform comparably, with Expert Iteration performing best in most cases. Surprisingly, we find the sample complexity of Expert Iteration is similar to that of PPO, requiring at most on the order of 10^6 samples to converge from a pretrained checkpoint. We investigate why this is the case, concluding that during RL training models fail to explore significantly beyond solutions already produced by SFT models. Additionally, we discuss a trade off between maj@1 and pass@96 metric performance during SFT training and how conversely RL training improves both simultaneously. We then conclude by discussing the implications of our findings for RLHF and the future role of RL in LLM fine-tuning.

利用强化学习教授大型语言模型推理

Teaching Large Language Models to Reason with Reinforcement Learning

摘要

Support