ChatPaper.aiChatPaper

通过强化学习训练语言模型进行自我纠正

Training Language Models to Self-Correct via Reinforcement Learning

September 19, 2024
作者: Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M Zhang, Kay McKinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, Doina Precup, Feryal Behbahani, Aleksandra Faust
cs.AI

摘要

自我校正是大型语言模型(LLMs)极为理想的能力,然而在现代LLMs中,它一直被发现效果不佳。现有的自我校正训练方法要么需要多个模型,要么依赖于更强大的模型或其他形式的监督。为此,我们开发了一种多轮在线强化学习(RL)方法SCoRe,通过完全自动生成的数据显著提高LLM的自我校正能力。为构建SCoRe,我们首先表明,在离线模型生成的校正轨迹的监督微调(SFT)的变体对于灌输自我校正行为是不够的。特别地,我们观察到通过SFT训练要么受到训练数据与模型自身响应之间的分布不匹配的困扰,要么隐式偏好于某种在测试时通常不有效的校正行为模式。SCoRe通过在模型自身生成的校正轨迹分布下训练,并使用适当的正则化来引导学习过程,使其学习一种在测试时有效的自我校正策略,而不是仅仅拟合给定提示的高回报响应。这种正则化规定在基础模型上运行第一阶段的RL以生成一个较不容易崩溃的策略初始化,然后使用奖励奖励来增强训练期间的自我校正。当应用于Gemini 1.0 Pro和1.5 Flash模型时,我们发现SCoRe在MATH和HumanEval基准测试中分别将基础模型的自我校正性能提高了15.6%和9.1%,达到了最先进的水平。
English
Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Existing approaches for training self-correction either require multiple models or rely on a more capable model or other forms of supervision. To this end, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are insufficient for instilling self-correction behavior. In particular, we observe that training via SFT either suffers from a distribution mismatch between the training data and the model's own responses or implicitly prefers only a certain mode of correction behavior that is often not effective at test time. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction strategy that is effective at test time as opposed to simply fitting high-reward responses for a given prompt. This regularization prescribes running a first phase of RL on a base model to generate a policy initialization that is less susceptible to collapse and then using a reward bonus to amplify self-correction during training. When applied to Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on the MATH and HumanEval benchmarks.

Summary

AI-Generated Summary

PDF1399November 16, 2024