Prometheus：在語言模型中引入細粒度評估能力

摘要

最近，使用強大的專有大型語言模型（LLM）（例如 GPT-4）作為長篇回應的評估器已成為事實上的標準。然而，對於有大規模評估任務和自定標準考量（例如兒童易讀性）的從業者來說，使用專有LLMs作為評估器是不可靠的，因為其封閉源代碼性質、無法控制的版本控制和高昂的成本。在這項工作中，我們提出了Prometheus，這是一個完全開源的LLM，當配有適當的參考資料（參考答案、分數標準）時，其評估能力與GPT-4相當。我們首先構建了反饋收集，這是一個新的數據集，包括1K個精細的分數標準、20K條指示和由GPT-4生成的10萬條回應和語言反饋。使用反饋收集，我們訓練了Prometheus，一個13B的評估器LLM，可以根據用戶提供的自定義分數標準評估任何給定的長篇文本。實驗結果顯示，當使用45個自定義分數標準進行評估時，Prometheus與人類評估者的皮爾森相關性為0.897，與GPT-4（0.882）相當，遠優於ChatGPT（0.392）。此外，使用1222個自定義分數標準在四個基準（MT Bench、Vicuna Bench、Feedback Bench、Flask Eval）上與GPT-4進行相關性測量，顯示出相似的趨勢，增強了Prometheus作為評估器LLM的能力。最後，與明確訓練於人類偏好數據集的開源獎勵模型相比，Prometheus在兩個人類偏好基準（HHH Alignment 和 MT Bench Human Judgment）上實現了最高的準確性，突顯了其作為通用獎勵模型的潛力。我們在 https://github.com/kaistAI/Prometheus 上開源了我們的代碼、數據集和模型。

English

Recently, using a powerful proprietary Large Language Model (LLM) (e.g., GPT-4) as an evaluator for long-form responses has become the de facto standard. However, for practitioners with large-scale evaluation tasks and custom criteria in consideration (e.g., child-readability), using proprietary LLMs as an evaluator is unreliable due to the closed-source nature, uncontrolled versioning, and prohibitive costs. In this work, we propose Prometheus, a fully open-source LLM that is on par with GPT-4's evaluation capabilities when the appropriate reference materials (reference answer, score rubric) are accompanied. We first construct the Feedback Collection, a new dataset that consists of 1K fine-grained score rubrics, 20K instructions, and 100K responses and language feedback generated by GPT-4. Using the Feedback Collection, we train Prometheus, a 13B evaluator LLM that can assess any given long-form text based on customized score rubric provided by the user. Experimental results show that Prometheus scores a Pearson correlation of 0.897 with human evaluators when evaluating with 45 customized score rubrics, which is on par with GPT-4 (0.882), and greatly outperforms ChatGPT (0.392). Furthermore, measuring correlation with GPT-4 with 1222 customized score rubrics across four benchmarks (MT Bench, Vicuna Bench, Feedback Bench, Flask Eval) shows similar trends, bolstering Prometheus's capability as an evaluator LLM. Lastly, Prometheus achieves the highest accuracy on two human preference benchmarks (HHH Alignment & MT Bench Human Judgment) compared to open-sourced reward models explicitly trained on human preference datasets, highlighting its potential as an universal reward model. We open-source our code, dataset, and model at https://github.com/kaistAI/Prometheus.

Prometheus：在語言模型中引入細粒度評估能力

Prometheus: Inducing Fine-grained Evaluation Capability in Language Models

摘要

Support