強化学習を用いた言語モデルの自己修正トレーニング

要旨

自己訂正は大規模言語モデル（LLMs）にとって非常に望ましい機能ですが、現代のLLMsでは効果がほとんどないことが一貫してわかっています。自己訂正のトレーニングの既存のアプローチは、複数のモデルが必要であるか、より能力の高いモデルや他の形式の監督を必要とします。このため、我々は、完全に自己生成されたデータを使用してLLMの自己訂正能力を大幅に向上させる、マルチターンオンライン強化学習（RL）アプローチであるSCoReを開発します。SCoReを構築するために、まず、オフラインモデル生成の訂正トレースに対する監督微調整（SFT）のバリアントが自己訂正行動を植え付けるのに不十分であることを示します。特に、SFTを介したトレーニングは、トレーニングデータとモデル自体の応答との分布の不一致に苦しんでいるか、しばしばテスト時に効果的でない特定の訂正行動モードだけを暗黙的に好むことが観察されます。SCoReは、モデル自体が生成した自己訂正トレースの分布に従ってトレーニングを行い、適切な正則化を使用して学習プロセスを導き、単に特定のプロンプトに対する高報酬応答を適合させるのではなく、テスト時に効果的な自己訂正戦略を学習するようにします。この正則化は、崩壊しにくいポリシー初期化を生成するためにベースモデルで最初のRLフェーズを実行し、トレーニング中に自己訂正を増幅するための報酬ボーナスを使用することを規定しています。Gemini 1.0 Proおよび1.5 Flashモデルに適用した結果、SCoReは、MATHおよびHumanEvalベンチマークでそれぞれベースモデルの自己訂正を15.6％および9.1％向上させ、最先端の自己訂正性能を達成します。

English

Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Existing approaches for training self-correction either require multiple models or rely on a more capable model or other forms of supervision. To this end, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are insufficient for instilling self-correction behavior. In particular, we observe that training via SFT either suffers from a distribution mismatch between the training data and the model's own responses or implicitly prefers only a certain mode of correction behavior that is often not effective at test time. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction strategy that is effective at test time as opposed to simply fitting high-reward responses for a given prompt. This regularization prescribes running a first phase of RL on a base model to generate a policy initialization that is less susceptible to collapse and then using a reward bonus to amplify self-correction during training. When applied to Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on the MATH and HumanEval benchmarks.

強化学習を用いた言語モデルの自己修正トレーニング

Training Language Models to Self-Correct via Reinforcement Learning

要旨

Support