利用強化學習教導大型語言模型進行推理

摘要

從人類反饋中學習的強化學習（RLHF）已成為將LLM輸出與人類偏好對齊的主導方法。受RLHF成功的啟發，我們研究了從反饋中學習的多種算法的性能（專家迭代、Proximal Policy Optimization (PPO)、Return-Conditioned RL）在改善LLM推理能力方面的表現。我們研究了提供給LLM的稀疏和密集獎勵，這些獎勵是通過啟發式方法和通過學習的獎勵模型提供的。此外，我們從多個模型大小和初始化開始，有的進行了監督微調（SFT）數據，有的沒有。總的來說，我們發現所有算法的表現大致相當，專家迭代在大多數情況下表現最佳。令人驚訝的是，我們發現專家迭代的樣本複雜度與PPO相似，從預訓練檢查點收斂最多需要10^6個樣本。我們調查了這種情況的原因，得出結論，在RL訓練期間，模型未能明顯地探索超出SFT模型已經產生的解決方案。此外，我們討論了在SFT訓練期間maj@1和pass@96指標表現之間的權衡，以及相反，RL訓練如何同時改善兩者。最後，我們討論了我們的研究結果對RLHF以及RL在LLM微調中未來角色的影響。

English

Reinforcement Learning from Human Feedback (RLHF) has emerged as a dominant approach for aligning LLM outputs with human preferences. Inspired by the success of RLHF, we study the performance of multiple algorithms that learn from feedback (Expert Iteration, Proximal Policy Optimization (PPO), Return-Conditioned RL) on improving LLM reasoning capabilities. We investigate both sparse and dense rewards provided to the LLM both heuristically and via a learned reward model. We additionally start from multiple model sizes and initializations both with and without supervised fine-tuning (SFT) data. Overall, we find all algorithms perform comparably, with Expert Iteration performing best in most cases. Surprisingly, we find the sample complexity of Expert Iteration is similar to that of PPO, requiring at most on the order of 10^6 samples to converge from a pretrained checkpoint. We investigate why this is the case, concluding that during RL training models fail to explore significantly beyond solutions already produced by SFT models. Additionally, we discuss a trade off between maj@1 and pass@96 metric performance during SFT training and how conversely RL training improves both simultaneously. We then conclude by discussing the implications of our findings for RLHF and the future role of RL in LLM fine-tuning.

利用強化學習教導大型語言模型進行推理

Teaching Large Language Models to Reason with Reinforcement Learning

摘要

Support