量化大型語言模型評判

摘要

LLM-as-a-judge 是一個框架，其中大型語言模型（LLM）自動評估另一個 LLM 的輸出。我們提出了定量 LLM 評判者，這些評判者使用回歸模型將現有 LLM 評判者的評估分數與特定領域中的人類評分對齊。這些模型通過使用評判者的文本評估和分數來訓練，以提高原始評判者的分數。我們展示了四種定量評判者，適用於不同類型的絕對和相對反饋，這展示了我們框架的通用性和多功能性。與監督微調相比，我們的框架在計算上更為高效，並且在人類反饋有限的情況下（這在我們的大多數應用中是預期的）可以更具統計效率。我們在四個數據集上使用兩個基礎評判者對這些主張進行了實證驗證。我們的實驗表明，定量評判者可以通過事後建模有效提升現有評判者的預測能力。

English

LLM-as-a-judge is a framework in which a large language model (LLM) automatically evaluates the output of another LLM. We propose quantitative LLM judges, which align evaluation scores of existing LLM judges to human scores in a given domain using regression models. The models are trained to improve the score of the original judge by using the judge's textual evaluation and score. We present four quantitative judges for different types of absolute and relative feedback, which showcases the generality and versatility of our framework. Our framework is more computationally efficient than supervised fine-tuning and can be more statistically efficient when human feedback is limited, which is expected in most applications of our work. We validate these claims empirically on four datasets using two base judges. Our experiments show that quantitative judges can effectively improve the predictive power of existing judges through post-hoc modeling.

量化大型語言模型評判

Quantitative LLM Judges

摘要

Support