ChatPaper.aiChatPaper

對話機器人競技場:一個開放平台,通過人類偏好來評估LLM

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

March 7, 2024
作者: Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, Ion Stoica
cs.AI

摘要

大型語言模型(LLMs)已經開啟了新的能力和應用;然而,評估其與人類偏好的一致性仍然存在著重大挑戰。為了應對這個問題,我們引入了 Chatbot Arena,這是一個基於人類偏好來評估LLMs的開放平台。我們的方法採用了兩兩比較的方式,並通過眾包的方式利用來自多樣化用戶群體的意見。該平台已運作了數個月,累積了超過240K的投票。本文描述了該平台,分析了我們迄今收集的數據,並解釋了我們正在使用的經過驗證的統計方法,以便對模型進行高效準確的評估和排名。我們確認眾包問題足夠多樣化和有區分性,而眾包的人類投票與專家評分者的投票達成了良好的一致性。這些分析共同為 Chatbot Arena 的可信度奠定了堅實的基礎。由於其獨特價值和開放性,Chatbot Arena 已成為最受尊敬的LLM排行榜之一,被領先的LLM開發者和公司廣泛引用。我們的演示可在 https://chat.lmsys.org 上公開獲得。
English
Large Language Models (LLMs) have unlocked new capabilities and applications; however, evaluating the alignment with human preferences still poses significant challenges. To address this issue, we introduce Chatbot Arena, an open platform for evaluating LLMs based on human preferences. Our methodology employs a pairwise comparison approach and leverages input from a diverse user base through crowdsourcing. The platform has been operational for several months, amassing over 240K votes. This paper describes the platform, analyzes the data we have collected so far, and explains the tried-and-true statistical methods we are using for efficient and accurate evaluation and ranking of models. We confirm that the crowdsourced questions are sufficiently diverse and discriminating and that the crowdsourced human votes are in good agreement with those of expert raters. These analyses collectively establish a robust foundation for the credibility of Chatbot Arena. Because of its unique value and openness, Chatbot Arena has emerged as one of the most referenced LLM leaderboards, widely cited by leading LLM developers and companies. Our demo is publicly available at https://chat.lmsys.org.
PDF411December 15, 2024