Prometheus 2:一个开源语言模型,专门用于评估其他语言模型。
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
May 2, 2024
作者: Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, Minjoon Seo
cs.AI
摘要
专有语言模型(LM)如GPT-4常被用于评估各种LM的响应质量。然而,透明度、可控性和可负担性等问题强烈推动了专门用于评估的开源LM的发展。另一方面,现有的开源评估LM存在重要缺陷:1)它们发出的分数与人类分配的分数有显著差异,2)它们缺乏执行直接评估和成对排名这两种最常见的评估形式的灵活性。此外,它们没有能力根据自定义评估标准进行评估,而是专注于诸如有用性和无害性之类的一般属性。为了解决这些问题,我们介绍了Prometheus 2,这是一种比其前身更强大的评估LM,它与人类和GPT-4的判断非常接近。此外,它能够处理直接评估和成对排名格式,并与用户定义的评估标准分组。在四个直接评估基准和四个成对排名基准上,Prometheus 2在所有经过测试的开源评估LM中与人类和专有LM评委的相关性和一致性得分最高。我们的模型、代码和数据都可以在以下网址公开获取:https://github.com/prometheus-eval/prometheus-eval。
English
Proprietary LMs such as GPT-4 are often employed to assess the quality of
responses from various LMs. However, concerns including transparency,
controllability, and affordability strongly motivate the development of
open-source LMs specialized in evaluations. On the other hand, existing open
evaluator LMs exhibit critical shortcomings: 1) they issue scores that
significantly diverge from those assigned by humans, and 2) they lack the
flexibility to perform both direct assessment and pairwise ranking, the two
most prevalent forms of assessment. Additionally, they do not possess the
ability to evaluate based on custom evaluation criteria, focusing instead on
general attributes like helpfulness and harmlessness. To address these issues,
we introduce Prometheus 2, a more powerful evaluator LM than its predecessor
that closely mirrors human and GPT-4 judgements. Moreover, it is capable of
processing both direct assessment and pair-wise ranking formats grouped with a
user-defined evaluation criteria. On four direct assessment benchmarks and four
pairwise ranking benchmarks, Prometheus 2 scores the highest correlation and
agreement with humans and proprietary LM judges among all tested open evaluator
LMs. Our models, code, and data are all publicly available at
https://github.com/prometheus-eval/prometheus-eval.Summary
AI-Generated Summary