用一组多样化模型评估LLM生成：将法官替换为陪审团

摘要

随着大型语言模型（LLMs）的进步，它们已经超越了我们准确评估其质量的能力。不仅是找到足够的数据来充分探究特定模型属性困难，而且仅评估模型自由生成的正确性本身就是一项挑战。为了解决这个问题，许多评估现在依赖于使用LLMs本身作为评判者来评分其他LLMs的输出质量。评估最常用的是像GPT4这样的单一大型模型。虽然这种方法越来越受欢迎，但成本高昂，已被证明会引入模型内部偏见，在这项工作中，我们发现非常大的模型通常是不必要的。我们建议改为使用LLM评估者小组（PoLL）来评估模型。在三种不同的评委设置和涵盖六个不同数据集的情况下，我们发现使用由较多较小模型组成的PoLL优于单一大评委，由于其由不同模型系列组成，表现出更少的模型内部偏见，并且成本要低七倍以上。

English

As Large Language Models (LLMs) have become more advanced, they have outpaced our abilities to accurately evaluate their quality. Not only is finding data to adequately probe particular model properties difficult, but evaluating the correctness of a model's freeform generation alone is a challenge. To address this, many evaluations now rely on using LLMs themselves as judges to score the quality of outputs from other LLMs. Evaluations most commonly use a single large model like GPT4. While this method has grown in popularity, it is costly, has been shown to introduce intramodel bias, and in this work, we find that very large models are often unnecessary. We propose instead to evaluate models using a Panel of LLm evaluators (PoLL). Across three distinct judge settings and spanning six different datasets, we find that using a PoLL composed of a larger number of smaller models outperforms a single large judge, exhibits less intra-model bias due to its composition of disjoint model families, and does so while being over seven times less expensive.

用一组多样化模型评估LLM生成：将法官替换为陪审团

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

摘要

Support