プロメテウス：言語モデルにおける細粒度評価能力の誘導

要旨

近年、強力な独自の大規模言語モデル（LLM）（例：GPT-4）を長文応答の評価者として使用することが事実上の標準となっている。しかし、大規模な評価タスクや特定の基準（例：子供向けの可読性）を考慮する実務家にとって、独自のLLMを評価者として使用することは、クローズドソースの性質、制御されないバージョン管理、および高額なコストのために信頼性が低い。本研究では、適切な参照資料（参照回答、評価基準）が提供された場合にGPT-4の評価能力に匹敵する完全なオープンソースのLLMであるPrometheusを提案する。まず、1,000の詳細な評価基準、20,000の指示、およびGPT-4によって生成された100,000の応答と言語フィードバックからなる新しいデータセットであるFeedback Collectionを構築する。Feedback Collectionを使用して、ユーザーが提供するカスタマイズされた評価基準に基づいて任意の長文テキストを評価できる13Bの評価者LLMであるPrometheusを訓練する。実験結果は、45のカスタマイズされた評価基準で評価した場合、Prometheusが人間の評価者とのピアソン相関0.897を記録し、GPT-4（0.882）に匹敵し、ChatGPT（0.392）を大きく上回ることを示している。さらに、4つのベンチマーク（MT Bench、Vicuna Bench、Feedback Bench、Flask Eval）で1,222のカスタマイズされた評価基準を用いてGPT-4との相関を測定した結果、同様の傾向が確認され、Prometheusの評価者LLMとしての能力が裏付けられた。最後に、Prometheusは、人間の選好データセットで明示的に訓練されたオープンソースの報酬モデルと比較して、2つの人間の選好ベンチマーク（HHH Alignment & MT Bench Human Judgment）で最高の精度を達成し、普遍的な報酬モデルとしての可能性を示している。我々は、コード、データセット、およびモデルをhttps://github.com/kaistAI/Prometheusでオープンソースとして公開している。

English

Recently, using a powerful proprietary Large Language Model (LLM) (e.g., GPT-4) as an evaluator for long-form responses has become the de facto standard. However, for practitioners with large-scale evaluation tasks and custom criteria in consideration (e.g., child-readability), using proprietary LLMs as an evaluator is unreliable due to the closed-source nature, uncontrolled versioning, and prohibitive costs. In this work, we propose Prometheus, a fully open-source LLM that is on par with GPT-4's evaluation capabilities when the appropriate reference materials (reference answer, score rubric) are accompanied. We first construct the Feedback Collection, a new dataset that consists of 1K fine-grained score rubrics, 20K instructions, and 100K responses and language feedback generated by GPT-4. Using the Feedback Collection, we train Prometheus, a 13B evaluator LLM that can assess any given long-form text based on customized score rubric provided by the user. Experimental results show that Prometheus scores a Pearson correlation of 0.897 with human evaluators when evaluating with 45 customized score rubrics, which is on par with GPT-4 (0.882), and greatly outperforms ChatGPT (0.392). Furthermore, measuring correlation with GPT-4 with 1222 customized score rubrics across four benchmarks (MT Bench, Vicuna Bench, Feedback Bench, Flask Eval) shows similar trends, bolstering Prometheus's capability as an evaluator LLM. Lastly, Prometheus achieves the highest accuracy on two human preference benchmarks (HHH Alignment & MT Bench Human Judgment) compared to open-sourced reward models explicitly trained on human preference datasets, highlighting its potential as an universal reward model. We open-source our code, dataset, and model at https://github.com/kaistAI/Prometheus.

プロメテウス：言語モデルにおける細粒度評価能力の誘導

Prometheus: Inducing Fine-grained Evaluation Capability in Language Models

要旨

Support