Prometheus 2：一個專門用於評估其他語言模型的開源語言模型

摘要

專有的語言模型，如GPT-4，常被用來評估來自各種語言模型的回應品質。然而，透明度、可控性和負擔能力等問題強烈促使開發專門用於評估的開源語言模型。另一方面，現有的開源評估語言模型存在嚴重缺陷：1）它們發出的分數與人類分配的分數顯著不同，2）它們缺乏執行直接評估和兩兩排名這兩種最常見的評估形式的靈活性。此外，它們沒有能力根據自定義的評估標準進行評估，而是專注於像幫助性和無害性這樣的一般特徵。為了解決這些問題，我們介紹了Prometheus 2，這是一個比其前身更強大的評估語言模型，它與人類和GPT-4的判斷非常相似。此外，它能夠處理直接評估和兩兩排名格式，並與用戶定義的評估標準分組。在四個直接評估基準和四個兩兩排名基準上，Prometheus 2在所有測試的開源評估語言模型中，與人類和專有語言模型評審之間的相關性和一致性得分最高。我們的模型、代碼和數據都可以在https://github.com/prometheus-eval/prometheus-eval 公開獲取。

English

Proprietary LMs such as GPT-4 are often employed to assess the quality of responses from various LMs. However, concerns including transparency, controllability, and affordability strongly motivate the development of open-source LMs specialized in evaluations. On the other hand, existing open evaluator LMs exhibit critical shortcomings: 1) they issue scores that significantly diverge from those assigned by humans, and 2) they lack the flexibility to perform both direct assessment and pairwise ranking, the two most prevalent forms of assessment. Additionally, they do not possess the ability to evaluate based on custom evaluation criteria, focusing instead on general attributes like helpfulness and harmlessness. To address these issues, we introduce Prometheus 2, a more powerful evaluator LM than its predecessor that closely mirrors human and GPT-4 judgements. Moreover, it is capable of processing both direct assessment and pair-wise ranking formats grouped with a user-defined evaluation criteria. On four direct assessment benchmarks and four pairwise ranking benchmarks, Prometheus 2 scores the highest correlation and agreement with humans and proprietary LM judges among all tested open evaluator LMs. Our models, code, and data are all publicly available at https://github.com/prometheus-eval/prometheus-eval.

Prometheus 2：一個專門用於評估其他語言模型的開源語言模型

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

摘要

Summary

Support