自己学習型評価器

要旨

モデルベースの評価は、成功するモデル開発の中核をなすものであり、トレーニングのための報酬モデルとして、また人間による評価の代替として機能します。このような評価器をトレーニングするための標準的なアプローチは、モデルの応答に対する大量の人間の選好判断を収集することですが、これはコストがかかり、モデルが改善されるにつれてデータが陳腐化してしまいます。本研究では、人間の注釈なしで評価器を改善することを目指し、合成トレーニングデータのみを使用するアプローチを提案します。ラベルなしの指示から始めて、反復的な自己改善スキームにより、対照的なモデル出力を生成し、LLM-as-a-Judge（LLMを評価者として使用する手法）をトレーニングして推論の痕跡と最終的な判断を生成し、改善された予測を使用して各新しい反復でこのトレーニングを繰り返します。ラベル付き選好データを一切使用せずに、私たちのSelf-Taught Evaluatorは、強力なLLM（Llama3-70B-Instruct）をRewardBenchで75.4から88.3（多数決では88.7）に改善することができます。これは、GPT-4などの一般的に使用されるLLM評価器を上回り、ラベル付き例でトレーニングされた最高性能の報酬モデルと同等の性能を発揮します。

English

Model-based evaluation is at the heart of successful model development -- as a reward model for training, and as a replacement for human evaluation. To train such evaluators, the standard approach is to collect a large amount of human preference judgments over model responses, which is costly and the data becomes stale as models improve. In this work, we present an approach that aims to im-prove evaluators without human annotations, using synthetic training data only. Starting from unlabeled instructions, our iterative self-improvement scheme generates contrasting model outputs and trains an LLM-as-a-Judge to produce reasoning traces and final judgments, repeating this training at each new iteration using the improved predictions. Without any labeled preference data, our Self-Taught Evaluator can improve a strong LLM (Llama3-70B-Instruct) from 75.4 to 88.3 (88.7 with majority vote) on RewardBench. This outperforms commonly used LLM judges such as GPT-4 and matches the performance of the top-performing reward models trained with labeled examples.

自己学習型評価器

Self-Taught Evaluators

要旨

Support