自学评估者

摘要

基于模型的评估是成功模型开发的核心 -- 作为训练的奖励模型，以及替代人类评估。为了训练这样的评估者，标准方法是收集大量关于模型响应的人类偏好判断，这既昂贵又随着模型改进而变得陈旧。在这项工作中，我们提出了一种旨在改进评估者的方法，无需人类注释，仅使用合成训练数据。从未标记的指令开始，我们的迭代自我改进方案生成对比模型输出，并训练一个LLM作为评判者产生推理迹象和最终判断，在每个新迭代中重复这种训练，使用改进的预测。在没有任何标记的偏好数据的情况下，我们的自学习评估者可以将强大的LLM（Llama3-70B-Instruct）从75.4提高到88.3（通过多数投票达到88.7）在RewardBench上。这优于常用的LLM评判者，如GPT-4，并与使用标记示例训练的性能最佳的奖励模型的性能相匹敌。

English

Model-based evaluation is at the heart of successful model development -- as a reward model for training, and as a replacement for human evaluation. To train such evaluators, the standard approach is to collect a large amount of human preference judgments over model responses, which is costly and the data becomes stale as models improve. In this work, we present an approach that aims to im-prove evaluators without human annotations, using synthetic training data only. Starting from unlabeled instructions, our iterative self-improvement scheme generates contrasting model outputs and trains an LLM-as-a-Judge to produce reasoning traces and final judgments, repeating this training at each new iteration using the improved predictions. Without any labeled preference data, our Self-Taught Evaluator can improve a strong LLM (Llama3-70B-Instruct) from 75.4 to 88.3 (88.7 with majority vote) on RewardBench. This outperforms commonly used LLM judges such as GPT-4 and matches the performance of the top-performing reward models trained with labeled examples.

自学评估者

Self-Taught Evaluators

摘要

Summary

Support

Support