以一組多樣模型評估LLM生成,取代法官使用陪審團
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models
April 29, 2024
作者: Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, Patrick Lewis
cs.AI
摘要
隨著大型語言模型(LLMs)的進步,它們已超越了我們準確評估其品質的能力。不僅是尋找足夠探測特定模型屬性的數據困難,單獨評估模型自由生成的正確性也是一項挑戰。為了應對這一問題,許多評估現在依賴於使用LLMs本身作為評判員來評分其他LLMs的輸出品質。評估最常使用像GPT4這樣的單一大型模型。儘管這種方法越來越受歡迎,但成本高昂,已被證明會引入模型內部偏見,在這項工作中,我們發現非常大的模型通常是不必要的。我們提出改用一個LLM評估員小組(PoLL)來評估模型。在三個不同的評判設置和六個不同的數據集中,我們發現使用由更多較小模型組成的PoLL勝過單一大型評判員,由於其由不同模型家族組成,表現出更少的模型內部偏見,而且成本低至七倍以上。
English
As Large Language Models (LLMs) have become more advanced, they have outpaced
our abilities to accurately evaluate their quality. Not only is finding data to
adequately probe particular model properties difficult, but evaluating the
correctness of a model's freeform generation alone is a challenge. To address
this, many evaluations now rely on using LLMs themselves as judges to score the
quality of outputs from other LLMs. Evaluations most commonly use a single
large model like GPT4. While this method has grown in popularity, it is costly,
has been shown to introduce intramodel bias, and in this work, we find that
very large models are often unnecessary. We propose instead to evaluate models
using a Panel of LLm evaluators (PoLL). Across three distinct judge settings
and spanning six different datasets, we find that using a PoLL composed of a
larger number of smaller models outperforms a single large judge, exhibits less
intra-model bias due to its composition of disjoint model families, and does so
while being over seven times less expensive.Summary
AI-Generated Summary