Chatbot Arena: 인간 선호도 기반 LLM 평가를 위한 오픈 플랫폼

초록

대형 언어 모델(LLMs)은 새로운 기능과 응용 분야를 개척했지만, 인간의 선호도와의 일치성을 평가하는 것은 여전히 상당한 과제로 남아 있습니다. 이 문제를 해결하기 위해 우리는 인간의 선호도를 기반으로 LLMs를 평가하기 위한 오픈 플랫폼인 Chatbot Arena를 소개합니다. 우리의 방법론은 쌍별 비교 접근법을 채택하고 크라우드소싱을 통해 다양한 사용자 기반의 입력을 활용합니다. 이 플랫폼은 수개월 동안 운영되며 24만 건 이상의 투표를 축적했습니다. 본 논문은 이 플랫폼을 설명하고, 지금까지 수집한 데이터를 분석하며, 모델을 효율적이고 정확하게 평가하고 순위를 매기기 위해 사용된 검증된 통계 방법을 설명합니다. 우리는 크라우드소싱된 질문이 충분히 다양하고 식별력이 있으며, 크라우드소싱된 인간 투표가 전문 평가자들의 투표와 잘 일치함을 확인했습니다. 이러한 분석들은 Chatbot Arena의 신뢰성을 위한 견고한 기반을 마련합니다. 그 독창적인 가치와 개방성으로 인해 Chatbot Arena는 주요 LLM 개발자와 기업들에 의해 널리 인용되는 가장 많이 참조되는 LLM 리더보드 중 하나로 부상했습니다. 우리의 데모는 https://chat.lmsys.org에서 공개적으로 이용 가능합니다.

English

Large Language Models (LLMs) have unlocked new capabilities and applications; however, evaluating the alignment with human preferences still poses significant challenges. To address this issue, we introduce Chatbot Arena, an open platform for evaluating LLMs based on human preferences. Our methodology employs a pairwise comparison approach and leverages input from a diverse user base through crowdsourcing. The platform has been operational for several months, amassing over 240K votes. This paper describes the platform, analyzes the data we have collected so far, and explains the tried-and-true statistical methods we are using for efficient and accurate evaluation and ranking of models. We confirm that the crowdsourced questions are sufficiently diverse and discriminating and that the crowdsourced human votes are in good agreement with those of expert raters. These analyses collectively establish a robust foundation for the credibility of Chatbot Arena. Because of its unique value and openness, Chatbot Arena has emerged as one of the most referenced LLM leaderboards, widely cited by leading LLM developers and companies. Our demo is publicly available at https://chat.lmsys.org.

Chatbot Arena: 인간 선호도 기반 LLM 평가를 위한 오픈 플랫폼

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

초록

Support