자기 보상 언어 모델

초록

우리는 초인간적 에이전트를 달성하기 위해서는 미래의 모델들이 적절한 훈련 신호를 제공하기 위해 초인간적 피드백이 필요하다고 가정한다. 현재의 접근법들은 일반적으로 인간의 선호도로부터 보상 모델을 훈련시키는데, 이는 인간의 성능 수준에 의해 병목 현상이 발생할 수 있으며, 둘째로 이러한 분리된 고정된 보상 모델들은 대형 언어 모델(LLM) 훈련 중에 개선을 학습할 수 없다. 본 연구에서는 언어 모델 자체가 LLM-as-a-Judge 프롬프팅을 통해 훈련 중에 자체 보상을 제공하는 자기 보상 언어 모델(Self-Rewarding Language Models)을 연구한다. 우리는 반복적 DPO 훈련 중에 명령 수행 능력이 향상될 뿐만 아니라, 자체적으로 고품질의 보상을 제공하는 능력도 향상됨을 보여준다. Llama 2 70B를 우리의 접근법의 세 번의 반복으로 미세 조정한 결과, AlpacaEval 2.0 리더보드에서 Claude 2, Gemini Pro, GPT-4 0613을 포함한 많은 기존 시스템들을 능가하는 모델을 얻었다. 이는 예비 연구에 불과하지만, 이 연구는 두 축에서 지속적으로 개선할 수 있는 모델의 가능성을 열어준다.

English

We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training. We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613. While only a preliminary study, this work opens the door to the possibility of models that can continually improve in both axes.