자가 학습 평가자

초록

모델 기반 평가는 성공적인 모델 개발의 핵심 요소입니다 - 훈련용 보상 모델로서, 그리고 인간 평가의 대체로서. 이러한 평가자를 훈련시키기 위한 표준 접근 방식은 모델 응답에 대한 인간 선호 판단을 대량으로 수집하는 것인데, 이는 비용이 많이 들며 모델이 개선됨에 따라 데이터가 오래되는 문제가 있습니다. 본 연구에서는 인간 주석 없이 합성 훈련 데이터만을 사용하여 평가자를 개선하는 방법을 제시합니다. 라벨이 없는 지시사항을 시작으로, 우리의 반복적 자가개선 방법은 대조적인 모델 출력을 생성하고 LLM-판사로 훈련시켜 추론 트레이스와 최종 판단을 내리게 합니다. 이 훈련은 개선된 예측을 사용하여 각 새로운 반복마다 반복되며, 라벨이 있는 선호 데이터 없이 우리의 자가학습 평가자는 RewardBench에서 강력한 LLM(Llama3-70B-Instruct)을 75.4에서 88.3으로 개선할 수 있습니다(다수결 투표를 통해 88.7). 이는 GPT-4와 같은 일반적으로 사용되는 LLM 판사를 능가하며, 라벨이 있는 예제로 훈련된 최고 성능의 보상 모델과 성능을 맞먹습니다.

English

Model-based evaluation is at the heart of successful model development -- as a reward model for training, and as a replacement for human evaluation. To train such evaluators, the standard approach is to collect a large amount of human preference judgments over model responses, which is costly and the data becomes stale as models improve. In this work, we present an approach that aims to im-prove evaluators without human annotations, using synthetic training data only. Starting from unlabeled instructions, our iterative self-improvement scheme generates contrasting model outputs and trains an LLM-as-a-Judge to produce reasoning traces and final judgments, repeating this training at each new iteration using the improved predictions. Without any labeled preference data, our Self-Taught Evaluator can improve a strong LLM (Llama3-70B-Instruct) from 75.4 to 88.3 (88.7 with majority vote) on RewardBench. This outperforms commonly used LLM judges such as GPT-4 and matches the performance of the top-performing reward models trained with labeled examples.

자가 학습 평가자

Self-Taught Evaluators

초록

Support