使用自生成数据进行中期训练可提升语言模型的强化学习效果

摘要

强化学习（RL）在大语言模型（LLMs）中的有效性取决于RL训练前及训练过程中所用数据的性质与多样性。特别是，推理问题通常可以通过依赖不同推理形式的多种方法来解决，而训练数据中若仅涵盖有限范围的此类方法，可能限制RL的效果。基于此，我们探索在RL训练前的中间训练阶段使用多样化的自生成数据。具体而言，我们采用基于乔治·波利亚的问题解决框架的自举式数据生成方法，为训练数据中的每个问题生成多个正确解法变体，随后进行微调。我们首先从理论层面论证了对此类数据进行中间训练如何改进RL，并解释了策略梯度更新如何激励多种方法的组合。随后通过实证表明，以我们的中间训练数据初始化的RL训练模型，在多项数学推理基准测试及代码生成、叙事推理等其他分布外任务中均取得持续改进。总体而言，我们的调查研究显示，语言模型通过自生成数据学习多种问题解决方法，有助于后续的RL训练。

English

The effectiveness of Reinforcement Learning (RL) in Large Language Models (LLMs) depends on the nature and diversity of the data used before and during RL. In particular, reasoning problems can often be approached in multiple ways that rely on different forms of reasoning, and exposure to only a limited range of such approaches in the training data may limit the effectiveness of RL. Motivated by this, we investigate using diverse self-generated data during mid-training as an intermediate step before RL training. Specifically, we adopt a bootstrapped data-generation framework guided by George Polya's problem-solving approaches for generating multiple variants of correct answers for each question in the training data, and then perform fine-tuning. We first provide a theoretical perspective on how mid-training on such data improves RL and explain how policy-gradient updates can incentivize combining multiple approaches. We then empirically demonstrate that RL-trained models initialized with our mid-training data achieve consistent improvements across various mathematical reasoning benchmarks and other OOD tasks like code generation and narrative reasoning. Overall, our investigative study shows that a language model learning multiple problem-solving approaches, through self-generated data helps subsequent RL.