自己報酬型言語モデル

要旨

我々は、超人的なエージェントを実現するためには、将来のモデルには適切な訓練信号を提供するための超人的なフィードバックが必要であると提唱する。現在のアプローチでは、人間の選好から報酬モデルを訓練することが一般的であるが、これは人間のパフォーマンスレベルによってボトルネックが生じる可能性があり、さらに、これらの分離された凍結された報酬モデルは、LLM訓練中に改善することを学ぶことができない。本研究では、訓練中に言語モデル自体がLLM-as-a-Judgeプロンプティングを通じて自身の報酬を提供する、自己報酬型言語モデル（Self-Rewarding Language Models）を研究する。我々は、反復的DPO訓練中に、指示追従能力が向上するだけでなく、自身に対して高品質な報酬を提供する能力も向上することを示す。Llama 2 70Bを我々のアプローチの3回の反復でファインチューニングした結果、AlpacaEval 2.0リーダーボードにおいてClaude 2、Gemini Pro、GPT-4 0613を含む多くの既存システムを上回るモデルが得られた。これは予備的な研究に過ぎないが、この研究は、両軸において継続的に改善可能なモデルの可能性への扉を開くものである。

English

We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training. We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613. While only a preliminary study, this work opens the door to the possibility of models that can continually improve in both axes.