Prometheus 2：一个开源语言模型，专门用于评估其他语言模型。

摘要

专有语言模型（LM）如GPT-4常被用于评估各种LM的响应质量。然而，透明度、可控性和可负担性等问题强烈推动了专门用于评估的开源LM的发展。另一方面，现有的开源评估LM存在重要缺陷：1）它们发出的分数与人类分配的分数有显著差异，2）它们缺乏执行直接评估和成对排名这两种最常见的评估形式的灵活性。此外，它们没有能力根据自定义评估标准进行评估，而是专注于诸如有用性和无害性之类的一般属性。为了解决这些问题，我们介绍了Prometheus 2，这是一种比其前身更强大的评估LM，它与人类和GPT-4的判断非常接近。此外，它能够处理直接评估和成对排名格式，并与用户定义的评估标准分组。在四个直接评估基准和四个成对排名基准上，Prometheus 2在所有经过测试的开源评估LM中与人类和专有LM评委的相关性和一致性得分最高。我们的模型、代码和数据都可以在以下网址公开获取：https://github.com/prometheus-eval/prometheus-eval。

English

Proprietary LMs such as GPT-4 are often employed to assess the quality of responses from various LMs. However, concerns including transparency, controllability, and affordability strongly motivate the development of open-source LMs specialized in evaluations. On the other hand, existing open evaluator LMs exhibit critical shortcomings: 1) they issue scores that significantly diverge from those assigned by humans, and 2) they lack the flexibility to perform both direct assessment and pairwise ranking, the two most prevalent forms of assessment. Additionally, they do not possess the ability to evaluate based on custom evaluation criteria, focusing instead on general attributes like helpfulness and harmlessness. To address these issues, we introduce Prometheus 2, a more powerful evaluator LM than its predecessor that closely mirrors human and GPT-4 judgements. Moreover, it is capable of processing both direct assessment and pair-wise ranking formats grouped with a user-defined evaluation criteria. On four direct assessment benchmarks and four pairwise ranking benchmarks, Prometheus 2 scores the highest correlation and agreement with humans and proprietary LM judges among all tested open evaluator LMs. Our models, code, and data are all publicly available at https://github.com/prometheus-eval/prometheus-eval.

Prometheus 2：一个开源语言模型，专门用于评估其他语言模型。

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

摘要

Summary

Support