以一組多樣模型評估LLM生成，取代法官使用陪審團

摘要

隨著大型語言模型（LLMs）的進步，它們已超越了我們準確評估其品質的能力。不僅是尋找足夠探測特定模型屬性的數據困難，單獨評估模型自由生成的正確性也是一項挑戰。為了應對這一問題，許多評估現在依賴於使用LLMs本身作為評判員來評分其他LLMs的輸出品質。評估最常使用像GPT4這樣的單一大型模型。儘管這種方法越來越受歡迎，但成本高昂，已被證明會引入模型內部偏見，在這項工作中，我們發現非常大的模型通常是不必要的。我們提出改用一個LLM評估員小組（PoLL）來評估模型。在三個不同的評判設置和六個不同的數據集中，我們發現使用由更多較小模型組成的PoLL勝過單一大型評判員，由於其由不同模型家族組成，表現出更少的模型內部偏見，而且成本低至七倍以上。

English

As Large Language Models (LLMs) have become more advanced, they have outpaced our abilities to accurately evaluate their quality. Not only is finding data to adequately probe particular model properties difficult, but evaluating the correctness of a model's freeform generation alone is a challenge. To address this, many evaluations now rely on using LLMs themselves as judges to score the quality of outputs from other LLMs. Evaluations most commonly use a single large model like GPT4. While this method has grown in popularity, it is costly, has been shown to introduce intramodel bias, and in this work, we find that very large models are often unnecessary. We propose instead to evaluate models using a Panel of LLm evaluators (PoLL). Across three distinct judge settings and spanning six different datasets, we find that using a PoLL composed of a larger number of smaller models outperforms a single large judge, exhibits less intra-model bias due to its composition of disjoint model families, and does so while being over seven times less expensive.

以一組多樣模型評估LLM生成，取代法官使用陪審團

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

摘要

Support