裁判官を陪審員に置き換える：多様なモデルからなるパネルによるLLM生成の評価

要旨

大規模言語モデル（LLM）が高度化するにつれ、その品質を正確に評価する能力が追いつかなくなってきている。特定のモデル特性を十分に探るためのデータを見つけることが難しいだけでなく、モデルの自由形式生成の正確性を評価すること自体が課題となっている。この問題に対処するため、現在では多くの評価において、他のLLMの出力品質をスコア付けするためにLLM自体を審査員として使用することが一般的になっている。評価では最も一般的にGPT4のような単一の大規模モデルが使用される。この方法は普及してきているものの、コストがかかり、モデル内バイアスを導入することが示されており、本研究では、非常に大規模なモデルがしばしば不要であることを明らかにしている。代わりに、我々はLLM評価者パネル（PoLL）を使用してモデルを評価することを提案する。3つの異なる審査設定と6つの異なるデータセットにわたる実験において、より多くの小型モデルで構成されたPoLLを使用することが、単一の大規模審査員を上回り、互いに異なるモデルファミリーで構成されているためモデル内バイアスが少なく、かつ7倍以上コスト効率が良いことを明らかにした。

English

As Large Language Models (LLMs) have become more advanced, they have outpaced our abilities to accurately evaluate their quality. Not only is finding data to adequately probe particular model properties difficult, but evaluating the correctness of a model's freeform generation alone is a challenge. To address this, many evaluations now rely on using LLMs themselves as judges to score the quality of outputs from other LLMs. Evaluations most commonly use a single large model like GPT4. While this method has grown in popularity, it is costly, has been shown to introduce intramodel bias, and in this work, we find that very large models are often unnecessary. We propose instead to evaluate models using a Panel of LLm evaluators (PoLL). Across three distinct judge settings and spanning six different datasets, we find that using a PoLL composed of a larger number of smaller models outperforms a single large judge, exhibits less intra-model bias due to its composition of disjoint model families, and does so while being over seven times less expensive.

裁判官を陪審員に置き換える：多様なモデルからなるパネルによるLLM生成の評価

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

要旨

Support