自我獎勵語言模型

摘要

我們認為，為了實現超人類智能體，未來的模型需要超人類的反饋，以提供足夠的訓練信號。目前的方法通常從人類偏好中訓練獎勵模型，但這可能會受到人類表現水平的限制，而且這些獨立凍結的獎勵模型無法在LLM訓練期間學習改進。在這項工作中，我們研究了自我獎勵語言模型，其中語言模型本身通過LLM作為評判提示，在訓練期間提供自己的獎勵。我們展示了在迭代DPO訓練期間，不僅指導能力得到改善，而且提供高質量獎勵給自己的能力也得到提升。通過我們方法的三次迭代對Llama 2 70B進行微調，得到的模型在AlpacaEval 2.0排行榜上表現優於許多現有系統，包括Claude 2、Gemini Pro和GPT-4 0613。儘管這只是一項初步研究，但這項工作為模型不斷在兩個方向上改進的可能性打開了大門。

English

We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training. We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613. While only a preliminary study, this work opens the door to the possibility of models that can continually improve in both axes.