リフレクト、リトライ、リワード：強化学習による自己改善型大規模言語モデル

要旨

大規模言語モデルの性能向上を目的として、自己反省と強化学習を組み合わせた手法を探求します。モデルが誤答した際に、より優れた自己反省を生成するよう促すことで、合成データの生成が困難で二値フィードバックしか得られない状況においても、複雑で検証可能なタスクを解決する能力が向上することを実証します。我々のフレームワークは2段階で動作します。まず、与えられたタスクに失敗した際に、モデルは前回の試みを分析する自己反省コメントを生成します。次に、その自己反省を文脈に含めた状態で、再度タスクに挑戦します。その後の試みが成功した場合、自己反省フェーズで生成されたトークンが報酬として与えられます。実験結果では、様々なモデルアーキテクチャにおいて大幅な性能向上が見られ、数式記述では最大34.7%、関数呼び出しでは18.1%の改善が確認されました。特に、ファインチューニングされた小規模モデル（15億から70億パラメータ）は、同じファミリーの10倍規模のモデルを上回る性能を示しました。この新たなパラダイムは、限定的な外部フィードバックのもとで困難なタスクに対して自己改善可能な、より有用で信頼性の高い言語モデルを実現するための有望な道筋となります。

English

We explore a method for improving the performance of large language models through self-reflection and reinforcement learning. By incentivizing the model to generate better self-reflections when it answers incorrectly, we demonstrate that a model's ability to solve complex, verifiable tasks can be enhanced even when generating synthetic data is infeasible and only binary feedback is available. Our framework operates in two stages: first, upon failing a given task, the model generates a self-reflective commentary analyzing its previous attempt; second, the model is given another attempt at the task with the self-reflection in context. If the subsequent attempt succeeds, the tokens generated during the self-reflection phase are rewarded. Our experimental results show substantial performance gains across a variety of model architectures, as high as 34.7% improvement at math equation writing and 18.1% improvement at function calling. Notably, smaller fine-tuned models (1.5 billion to 7 billion parameters) outperform models in the same family that are 10 times larger. Our novel paradigm is thus an exciting pathway to more useful and reliable language models that can self-improve on challenging tasks with limited external feedback.

リフレクト、リトライ、リワード：強化学習による自己改善型大規模言語モデル

Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning

要旨

Support