판사를 배심원으로 대체하기: 다양한 모델 패널을 활용한 LLM 생성물 평가

초록

대규모 언어 모델(LLM)이 발전함에 따라, 우리는 이들의 품질을 정확히 평가하는 능력을 따라가지 못하고 있습니다. 특정 모델 속성을 적절히 탐구할 데이터를 찾는 것뿐만 아니라, 모델의 자유 형식 생성 결과의 정확성을 평가하는 것 자체가 어려운 과제입니다. 이를 해결하기 위해, 현재 많은 평가에서는 다른 LLM의 출력 품질을 평가하기 위해 LLM 자체를 판단자로 사용하고 있습니다. 가장 일반적으로는 GPT4와 같은 단일 대형 모델을 사용합니다. 이 방법은 점점 인기를 얻고 있지만, 비용이 많이 들고, 모델 내 편향을 유발할 수 있으며, 본 연구에서는 매우 큰 모델이 종종 불필요하다는 것을 발견했습니다. 대신, 우리는 LLM 평가자 패널(Panel of LLM evaluators, PoLL)을 사용하여 모델을 평가할 것을 제안합니다. 세 가지 서로 다른 판단자 설정과 여섯 개의 서로 다른 데이터셋에 걸쳐, 더 많은 수의 소형 모델로 구성된 PoLL이 단일 대형 판단자를 능가하고, 서로 다른 모델 패밀리로 구성되어 모델 내 편향이 적으며, 이를 수행하면서도 비용이 7배 이상 적게 드는 것을 확인했습니다.

English

As Large Language Models (LLMs) have become more advanced, they have outpaced our abilities to accurately evaluate their quality. Not only is finding data to adequately probe particular model properties difficult, but evaluating the correctness of a model's freeform generation alone is a challenge. To address this, many evaluations now rely on using LLMs themselves as judges to score the quality of outputs from other LLMs. Evaluations most commonly use a single large model like GPT4. While this method has grown in popularity, it is costly, has been shown to introduce intramodel bias, and in this work, we find that very large models are often unnecessary. We propose instead to evaluate models using a Panel of LLm evaluators (PoLL). Across three distinct judge settings and spanning six different datasets, we find that using a PoLL composed of a larger number of smaller models outperforms a single large judge, exhibits less intra-model bias due to its composition of disjoint model families, and does so while being over seven times less expensive.

판사를 배심원으로 대체하기: 다양한 모델 패널을 활용한 LLM 생성물 평가

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

초록

Support