量化大语言模型评估

摘要

LLM-as-a-judge 是一种框架，其中大型语言模型（LLM）自动评估另一个LLM的输出。我们提出了定量LLM评判者，通过回归模型将现有LLM评判者的评分与特定领域的人类评分对齐。这些模型通过利用评判者的文本评估和评分进行训练，旨在提升原始评判者的评分准确性。我们展示了四种适用于不同类型绝对和相对反馈的定量评判者，体现了我们框架的通用性和多功能性。相较于监督微调，我们的框架在计算上更为高效，且在人类反馈有限的情况下（这在我们工作的多数应用中预期如此），统计效率更高。我们使用两个基础评判者在四个数据集上对这些主张进行了实证验证。实验结果表明，定量评判者能够通过事后建模有效提升现有评判者的预测能力。

English

LLM-as-a-judge is a framework in which a large language model (LLM) automatically evaluates the output of another LLM. We propose quantitative LLM judges, which align evaluation scores of existing LLM judges to human scores in a given domain using regression models. The models are trained to improve the score of the original judge by using the judge's textual evaluation and score. We present four quantitative judges for different types of absolute and relative feedback, which showcases the generality and versatility of our framework. Our framework is more computationally efficient than supervised fine-tuning and can be more statistically efficient when human feedback is limited, which is expected in most applications of our work. We validate these claims empirically on four datasets using two base judges. Our experiments show that quantitative judges can effectively improve the predictive power of existing judges through post-hoc modeling.

量化大语言模型评估

Quantitative LLM Judges

摘要

Support