自己生成データを用いた中間訓練は言語モデルにおける強化学習を改善する

要旨

大規模言語モデル（LLM）における強化学習（RL）の有効性は、RLの前および最中に使用されるデータの性質と多様性に依存する。特に、推論問題はしばしば異なる推論形式に依存する複数の方法でアプローチすることが可能であり、訓練データにおいてそうしたアプローチの限られた範囲のみに触れることは、RLの有効性を制限する可能性がある。この動機に基づき、我々はRL訓練前の中間段階として、多様な自己生成データを中間訓練に用いることを調査する。具体的には、ジョージ・ポリアの問題解決アプローチに従ったブートストラップ型データ生成フレームワークを採用し、訓練データ内の各問題に対して正解の複数のバリアントを生成した上で、ファインチューニングを実施する。まず、このようなデータによる中間訓練がRLを改善する理論的視点を提供し、ポリシー勾配更新が複数のアプローチの組み合わせを促進する仕組みを説明する。次に、我々の中間訓練データで初期化されたRL訓練モデルが、様々な数学的推論ベンチマークや、コード生成、物語推論といった他のOODタスクにおいて一貫した改善を達成することを実証する。全体として、我々の調査研究は、言語モデルが自己生成データを通じて複数の問題解決アプローチを学習することが、その後のRLに寄与することを示している。

English

The effectiveness of Reinforcement Learning (RL) in Large Language Models (LLMs) depends on the nature and diversity of the data used before and during RL. In particular, reasoning problems can often be approached in multiple ways that rely on different forms of reasoning, and exposure to only a limited range of such approaches in the training data may limit the effectiveness of RL. Motivated by this, we investigate using diverse self-generated data during mid-training as an intermediate step before RL training. Specifically, we adopt a bootstrapped data-generation framework guided by George Polya's problem-solving approaches for generating multiple variants of correct answers for each question in the training data, and then perform fine-tuning. We first provide a theoretical perspective on how mid-training on such data improves RL and explain how policy-gradient updates can incentivize combining multiple approaches. We then empirically demonstrate that RL-trained models initialized with our mid-training data achieve consistent improvements across various mathematical reasoning benchmarks and other OOD tasks like code generation and narrative reasoning. Overall, our investigative study shows that a language model learning multiple problem-solving approaches, through self-generated data helps subsequent RL.