自我學習評估者
Self-Taught Evaluators
August 5, 2024
作者: Tianlu Wang, Ilia Kulikov, Olga Golovneva, Ping Yu, Weizhe Yuan, Jane Dwivedi-Yu, Richard Yuanzhe Pang, Maryam Fazel-Zarandi, Jason Weston, Xian Li
cs.AI
摘要
模型驗證是成功模型開發的核心 — 作為訓練的獎勵模型,並取代人類評估。為了訓練這樣的評估器,標準方法是收集大量關於模型回應的人類偏好判斷,這既昂貴又因模型改進而使數據過時。在這項工作中,我們提出了一種方法,旨在通過僅使用合成訓練數據來改進評估器,而無需人類標註。從未標記的指示開始,我們的迭代自我改進方案生成對比模型輸出,並訓練一個 LLM 作為評判來生成推理軌跡和最終判斷,在每個新迭代中重複使用改進的預測進行訓練。在沒有任何標記的偏好數據的情況下,我們的自學習評估器可以將強大的 LLM (Llama3-70B-Instruct) 從 75.4 提高到 88.3 (多數票為 88.7) 在 RewardBench 上。這優於常用的 LLM 評判器,如 GPT-4,並與使用標記示例訓練的表現最佳的獎勵模型的性能相匹敵。
English
Model-based evaluation is at the heart of successful model development -- as
a reward model for training, and as a replacement for human evaluation. To
train such evaluators, the standard approach is to collect a large amount of
human preference judgments over model responses, which is costly and the data
becomes stale as models improve. In this work, we present an approach that aims
to im-prove evaluators without human annotations, using synthetic training data
only. Starting from unlabeled instructions, our iterative self-improvement
scheme generates contrasting model outputs and trains an LLM-as-a-Judge to
produce reasoning traces and final judgments, repeating this training at each
new iteration using the improved predictions. Without any labeled preference
data, our Self-Taught Evaluator can improve a strong LLM (Llama3-70B-Instruct)
from 75.4 to 88.3 (88.7 with majority vote) on RewardBench. This outperforms
commonly used LLM judges such as GPT-4 and matches the performance of the
top-performing reward models trained with labeled examples.Summary
AI-Generated Summary