用一组多样化模型评估LLM生成:将法官替换为陪审团
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models
April 29, 2024
作者: Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, Patrick Lewis
cs.AI
摘要
随着大型语言模型(LLMs)的进步,它们已经超越了我们准确评估其质量的能力。不仅是找到足够的数据来充分探究特定模型属性困难,而且仅评估模型自由生成的正确性本身就是一项挑战。为了解决这个问题,许多评估现在依赖于使用LLMs本身作为评判者来评分其他LLMs的输出质量。评估最常用的是像GPT4这样的单一大型模型。虽然这种方法越来越受欢迎,但成本高昂,已被证明会引入模型内部偏见,在这项工作中,我们发现非常大的模型通常是不必要的。我们建议改为使用LLM评估者小组(PoLL)来评估模型。在三种不同的评委设置和涵盖六个不同数据集的情况下,我们发现使用由较多较小模型组成的PoLL优于单一大评委,由于其由不同模型系列组成,表现出更少的模型内部偏见,并且成本要低七倍以上。
English
As Large Language Models (LLMs) have become more advanced, they have outpaced
our abilities to accurately evaluate their quality. Not only is finding data to
adequately probe particular model properties difficult, but evaluating the
correctness of a model's freeform generation alone is a challenge. To address
this, many evaluations now rely on using LLMs themselves as judges to score the
quality of outputs from other LLMs. Evaluations most commonly use a single
large model like GPT4. While this method has grown in popularity, it is costly,
has been shown to introduce intramodel bias, and in this work, we find that
very large models are often unnecessary. We propose instead to evaluate models
using a Panel of LLm evaluators (PoLL). Across three distinct judge settings
and spanning six different datasets, we find that using a PoLL composed of a
larger number of smaller models outperforms a single large judge, exhibits less
intra-model bias due to its composition of disjoint model families, and does so
while being over seven times less expensive.Summary
AI-Generated Summary