使用自我生成資料的中期訓練改善語言模型中的強化學習
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models
May 8, 2026
作者: Aswin RRV, Jacob Dineen, Divij Handa, Mihir Parmar, Ben Zhou, Swaroop Mishra, Chitta Baral
cs.AI
摘要
強化學習(RL)在大語言模型(LLMs)中的有效性,取決於RL訓練前及訓練過程中所使用資料的性質與多樣性。尤其是,推理問題通常可透過依賴不同推理形式的多種方法來處理,而訓練資料中若僅接觸有限範圍內的此類方法,可能會限制RL的有效性。基於此,我們探討在RL訓練之前,於中期訓練階段使用多樣化的自行生成資料作為中間步驟。具體而言,我們採用由喬治·波利亞(George Polya)解題方法所引導的引導式資料生成框架,為訓練資料中的每個問題生成多種正確答案的變體,隨後進行微調。我們首先從理論角度探討在此類資料上進行中期訓練如何改善RL,並解釋策略梯度更新如何激勵結合多種方法。接著,我們透過實驗證明了使用我們中期訓練資料初始化的RL訓練模型,在多個數學推理基準測試以及其他分佈外(OOD)任務(如程式碼生成與敘事推理)中,均能取得一致的改善。總體而言,我們的調查研究顯示,語言模型透過自行生成資料學習多種解題方法,有助於後續的RL。
English
The effectiveness of Reinforcement Learning (RL) in Large Language Models (LLMs) depends on the nature and diversity of the data used before and during RL. In particular, reasoning problems can often be approached in multiple ways that rely on different forms of reasoning, and exposure to only a limited range of such approaches in the training data may limit the effectiveness of RL. Motivated by this, we investigate using diverse self-generated data during mid-training as an intermediate step before RL training. Specifically, we adopt a bootstrapped data-generation framework guided by George Polya's problem-solving approaches for generating multiple variants of correct answers for each question in the training data, and then perform fine-tuning. We first provide a theoretical perspective on how mid-training on such data improves RL and explain how policy-gradient updates can incentivize combining multiple approaches. We then empirically demonstrate that RL-trained models initialized with our mid-training data achieve consistent improvements across various mathematical reasoning benchmarks and other OOD tasks like code generation and narrative reasoning. Overall, our investigative study shows that a language model learning multiple problem-solving approaches, through self-generated data helps subsequent RL.