Prometheus 2:一個專門用於評估其他語言模型的開源語言模型
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
May 2, 2024
作者: Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, Minjoon Seo
cs.AI
摘要
專有的語言模型,如GPT-4,常被用來評估來自各種語言模型的回應品質。然而,透明度、可控性和負擔能力等問題強烈促使開發專門用於評估的開源語言模型。另一方面,現有的開源評估語言模型存在嚴重缺陷:1)它們發出的分數與人類分配的分數顯著不同,2)它們缺乏執行直接評估和兩兩排名這兩種最常見的評估形式的靈活性。此外,它們沒有能力根據自定義的評估標準進行評估,而是專注於像幫助性和無害性這樣的一般特徵。為了解決這些問題,我們介紹了Prometheus 2,這是一個比其前身更強大的評估語言模型,它與人類和GPT-4的判斷非常相似。此外,它能夠處理直接評估和兩兩排名格式,並與用戶定義的評估標準分組。在四個直接評估基準和四個兩兩排名基準上,Prometheus 2在所有測試的開源評估語言模型中,與人類和專有語言模型評審之間的相關性和一致性得分最高。我們的模型、代碼和數據都可以在https://github.com/prometheus-eval/prometheus-eval 公開獲取。
English
Proprietary LMs such as GPT-4 are often employed to assess the quality of
responses from various LMs. However, concerns including transparency,
controllability, and affordability strongly motivate the development of
open-source LMs specialized in evaluations. On the other hand, existing open
evaluator LMs exhibit critical shortcomings: 1) they issue scores that
significantly diverge from those assigned by humans, and 2) they lack the
flexibility to perform both direct assessment and pairwise ranking, the two
most prevalent forms of assessment. Additionally, they do not possess the
ability to evaluate based on custom evaluation criteria, focusing instead on
general attributes like helpfulness and harmlessness. To address these issues,
we introduce Prometheus 2, a more powerful evaluator LM than its predecessor
that closely mirrors human and GPT-4 judgements. Moreover, it is capable of
processing both direct assessment and pair-wise ranking formats grouped with a
user-defined evaluation criteria. On four direct assessment benchmarks and four
pairwise ranking benchmarks, Prometheus 2 scores the highest correlation and
agreement with humans and proprietary LM judges among all tested open evaluator
LMs. Our models, code, and data are all publicly available at
https://github.com/prometheus-eval/prometheus-eval.Summary
AI-Generated Summary