自奖励语言模型

摘要

我们认为，为了实现超人类智能体，未来的模型需要超人类反馈，以提供充分的训练信号。目前的方法通常是从人类偏好中训练奖励模型，然后可能受限于人类表现水平，其次，这些独立的冻结奖励模型无法在LLM训练期间学习改进。在这项工作中，我们研究了自我奖励语言模型，其中语言模型本身通过LLM作为评判者提示在训练期间提供自己的奖励。我们展示了在迭代DPO训练期间，不仅遵循指令的能力得到改善，而且提供高质量奖励给自己的能力也得到提升。在我们方法的三次迭代上对Llama 2 70B进行微调，得到的模型在AlpacaEval 2.0排行榜上胜过许多现有系统，包括Claude 2、Gemini Pro和GPT-4 0613。虽然这只是一项初步研究，但这项工作为可能不断改进两个方面的模型打开了大门。

English

We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training. We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613. While only a preliminary study, this work opens the door to the possibility of models that can continually improve in both axes.