판사 평가하기: LLM 판사 시스템의 정렬성과 취약점 분석

초록

인간 평가와 관련된 확장성 문제에 대한 유망한 해결책으로 제시된 LLM-as-a-judge 패러다임은 대규모 언어 모델(LLMs) 평가 접근법으로 빠르게 주목받고 있습니다. 그러나 이 패러다임의 강점과 약점, 그리고 잠재적인 편향에 대해서는 여전히 많은 의문점이 남아 있습니다. 본 논문에서는 다양한 LLM이 판단자 역할을 수행할 때의 성능에 대한 포괄적인 연구를 제시합니다. 우리는 TriviaQA를 벤치마크로 활용하여 LLM의 객관적 지식 추론 능력을 평가하고, 높은 평가자 간 일치도를 보인 인간 주석과 함께 이를 평가합니다. 우리의 연구에는 9개의 판단자 모델과 9개의 시험 응시자 모델(기본 모델과 지시 튜닝 모델 모두 포함)이 포함됩니다. 우리는 판단자 모델의 일치도를 모델 크기, 계열, 그리고 판단자 프롬프트에 따라 평가합니다. 여러 결과 중에서, 우리의 연구는 단순한 백분율 일치도 대신 Cohen's kappa를 일치도 지표로 사용하는 중요성을 재발견하며, 높은 백분율 일치도를 보이는 판단자라도 매우 다른 점수를 부여할 수 있음을 보여줍니다. 우리는 Llama-3 70B와 GPT-4 Turbo가 인간과의 우수한 일치도를 보이지만, 시험 응시자 모델의 순위를 매기는 데 있어서는 인간 일치도가 최대 34점 낮은 JudgeLM-7B와 어휘적 판단자인 Contains에 뒤처지는 것을 발견했습니다. 오류 분석과 지시 길이 및 관대함 편향의 효과를 포함한 다양한 연구를 통해, 우리는 앞으로 LLM을 판단자로 사용하는 데 있어 유용한 교훈을 제공하고자 합니다.

English

Offering a promising solution to the scalability challenges associated with human evaluation, the LLM-as-a-judge paradigm is rapidly gaining traction as an approach to evaluating large language models (LLMs). However, there are still many open questions about the strengths and weaknesses of this paradigm, and what potential biases it may hold. In this paper, we present a comprehensive study of the performance of various LLMs acting as judges. We leverage TriviaQA as a benchmark for assessing objective knowledge reasoning of LLMs and evaluate them alongside human annotations which we found to have a high inter-annotator agreement. Our study includes 9 judge models and 9 exam taker models -- both base and instruction-tuned. We assess the judge model's alignment across different model sizes, families, and judge prompts. Among other results, our research rediscovers the importance of using Cohen's kappa as a metric of alignment as opposed to simple percent agreement, showing that judges with high percent agreement can still assign vastly different scores. We find that both Llama-3 70B and GPT-4 Turbo have an excellent alignment with humans, but in terms of ranking exam taker models, they are outperformed by both JudgeLM-7B and the lexical judge Contains, which have up to 34 points lower human alignment. Through error analysis and various other studies, including the effects of instruction length and leniency bias, we hope to provide valuable lessons for using LLMs as judges in the future.

판사 평가하기: LLM 판사 시스템의 정렬성과 취약점 분석

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

초록

Support