对话机器人竞技场:一个用人类偏好评估LLMs的开放平台
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
March 7, 2024
作者: Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, Ion Stoica
cs.AI
摘要
大型语言模型(LLMs)已经开启了新的能力和应用领域;然而,评估其与人类偏好的一致性仍然存在重大挑战。为了解决这一问题,我们引入了Chatbot Arena,这是一个基于人类偏好评估LLMs的开放平台。我们的方法采用了一种两两比较的方式,并通过众包获取了来自多样化用户群体的输入。该平台已经运行了数个月,积累了超过240K的投票。本文描述了该平台,分析了我们迄今收集的数据,并解释了我们正在使用的高效准确的评估和排名模型的经过验证的统计方法。我们确认众包问题足够多样化和具有区分性,并且众包人类投票与专家评分者的投票具有很好的一致性。这些分析共同为Chatbot Arena的可信度奠定了坚实基础。由于其独特价值和开放性,Chatbot Arena已成为最受引用的LLM排行榜之一,被领先的LLM开发者和公司广泛引用。我们的演示可在https://chat.lmsys.org 上公开获取。
English
Large Language Models (LLMs) have unlocked new capabilities and applications;
however, evaluating the alignment with human preferences still poses
significant challenges. To address this issue, we introduce Chatbot Arena, an
open platform for evaluating LLMs based on human preferences. Our methodology
employs a pairwise comparison approach and leverages input from a diverse user
base through crowdsourcing. The platform has been operational for several
months, amassing over 240K votes. This paper describes the platform, analyzes
the data we have collected so far, and explains the tried-and-true statistical
methods we are using for efficient and accurate evaluation and ranking of
models. We confirm that the crowdsourced questions are sufficiently diverse and
discriminating and that the crowdsourced human votes are in good agreement with
those of expert raters. These analyses collectively establish a robust
foundation for the credibility of Chatbot Arena. Because of its unique value
and openness, Chatbot Arena has emerged as one of the most referenced LLM
leaderboards, widely cited by leading LLM developers and companies. Our demo is
publicly available at https://chat.lmsys.org.