Prometheus：在语言模型中引入细粒度评估能力

摘要

最近，将强大的专有大型语言模型（LLM）（例如GPT-4）用作长篇回复的评估器已成为事实上的标准。然而，对于有大规模评估任务和自定义标准考量的从业者（例如，儿童可读性），使用专有LLM作为评估器是不可靠的，因为其闭源性质、不受控的版本控制和高昂的成本。在这项工作中，我们提出了Prometheus，这是一个完全开源的LLM，当配备适当的参考材料（参考答案、评分标准）时，其评估能力与GPT-4相媲美。我们首先构建了反馈收集，这是一个新的数据集，包括1K个细粒度评分标准、20K个指导说明以及由GPT-4生成的100K个回复和语言反馈。利用反馈收集，我们训练了Prometheus，一个13B的评估器LLM，可以根据用户提供的自定义评分标准评估任何给定的长篇文本。实验结果显示，当使用45个自定义评分标准进行评估时，Prometheus与人类评估者的皮尔逊相关系数为0.897，与GPT-4（0.882）相当，并且远远优于ChatGPT（0.392）。此外，使用1222个自定义评分标准在四个基准（MT Bench、Vicuna Bench、Feedback Bench、Flask Eval）上与GPT-4进行相关性测量显示出类似的趋势，增强了Prometheus作为评估器LLM的能力。最后，与明确针对人类偏好数据集进行训练的开源奖励模型相比，Prometheus在两个人类偏好基准（HHH Alignment和MT Bench Human Judgment）上实现了最高的准确性，突显了其作为通用奖励模型的潜力。我们在https://github.com/kaistAI/Prometheus 开源了我们的代码、数据集和模型。

English

Recently, using a powerful proprietary Large Language Model (LLM) (e.g., GPT-4) as an evaluator for long-form responses has become the de facto standard. However, for practitioners with large-scale evaluation tasks and custom criteria in consideration (e.g., child-readability), using proprietary LLMs as an evaluator is unreliable due to the closed-source nature, uncontrolled versioning, and prohibitive costs. In this work, we propose Prometheus, a fully open-source LLM that is on par with GPT-4's evaluation capabilities when the appropriate reference materials (reference answer, score rubric) are accompanied. We first construct the Feedback Collection, a new dataset that consists of 1K fine-grained score rubrics, 20K instructions, and 100K responses and language feedback generated by GPT-4. Using the Feedback Collection, we train Prometheus, a 13B evaluator LLM that can assess any given long-form text based on customized score rubric provided by the user. Experimental results show that Prometheus scores a Pearson correlation of 0.897 with human evaluators when evaluating with 45 customized score rubrics, which is on par with GPT-4 (0.882), and greatly outperforms ChatGPT (0.392). Furthermore, measuring correlation with GPT-4 with 1222 customized score rubrics across four benchmarks (MT Bench, Vicuna Bench, Feedback Bench, Flask Eval) shows similar trends, bolstering Prometheus's capability as an evaluator LLM. Lastly, Prometheus achieves the highest accuracy on two human preference benchmarks (HHH Alignment & MT Bench Human Judgment) compared to open-sourced reward models explicitly trained on human preference datasets, highlighting its potential as an universal reward model. We open-source our code, dataset, and model at https://github.com/kaistAI/Prometheus.

Prometheus：在语言模型中引入细粒度评估能力

Prometheus: Inducing Fine-grained Evaluation Capability in Language Models

摘要

Support