ChatPaper.aiChatPaper

WildBench:使用来自真实用户的具有挑战性任务对LLM进行基准测试

WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

June 7, 2024
作者: Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, Yejin Choi
cs.AI

摘要

我们介绍了WildBench,这是一个自动化评估框架,旨在使用具有挑战性的真实用户查询对大型语言模型(LLMs)进行基准测试。WildBench包括从超过一百万个人-聊天机器人对话日志中精心挑选出的1,024个任务。为了使用WildBench进行自动化评估,我们开发了两个指标,即WB-Reward和WB-Score,这些指标可使用高级LLMs(如GPT-4-turbo)进行计算。WildBench评估使用特定任务的检查表系统地评估模型输出,并提供结构化解释,用以证明得分和比较,从而产生更可靠和可解释的自动判断。WB-Reward采用模型响应之间的细粒度两两比较,生成五种潜在结果:更好得多、稍微更好、稍微更差、更差得多或平局。与以往只使用单个基准模型的评估不同,我们选择了三个基准模型,性能水平不同,以确保全面的两两评估。此外,我们提出了一种简单的方法来减轻长度偏差,即如果获胜响应超过败者响应超过K个字符,则将“稍微更好/更差”的结果转换为“平局”。WB-Score单独评估模型输出的质量,使其成为一种快速且成本效益高的评估指标。WildBench的结果显示,与Chatbot Arena上难度较大任务的人类评分Elo评级强相关。具体而言,WB-Reward与排名靠前的模型达到了0.98的皮尔逊相关性。此外,WB-Score达到了0.95,超过了ArenaHard的0.91和AlpacaEval2.0的0.89的长度控制胜率,以及0.87的常规胜率。
English
We introduce WildBench, an automated evaluation framework designed to benchmark large language models (LLMs) using challenging, real-world user queries. WildBench consists of 1,024 tasks carefully selected from over one million human-chatbot conversation logs. For automated evaluation with WildBench, we have developed two metrics, WB-Reward and WB-Score, which are computable using advanced LLMs such as GPT-4-turbo. WildBench evaluation uses task-specific checklists to evaluate model outputs systematically and provides structured explanations that justify the scores and comparisons, resulting in more reliable and interpretable automatic judgments. WB-Reward employs fine-grained pairwise comparisons between model responses, generating five potential outcomes: much better, slightly better, slightly worse, much worse, or a tie. Unlike previous evaluations that employed a single baseline model, we selected three baseline models at varying performance levels to ensure a comprehensive pairwise evaluation. Additionally, we propose a simple method to mitigate length bias, by converting outcomes of ``slightly better/worse'' to ``tie'' if the winner response exceeds the loser one by more than K characters. WB-Score evaluates the quality of model outputs individually, making it a fast and cost-efficient evaluation metric. WildBench results demonstrate a strong correlation with the human-voted Elo ratings from Chatbot Arena on hard tasks. Specifically, WB-Reward achieves a Pearson correlation of 0.98 with top-ranking models. Additionally, WB-Score reaches 0.95, surpassing both ArenaHard's 0.91 and AlpacaEval2.0's 0.89 for length-controlled win rates, as well as the 0.87 for regular win rates.

Summary

AI-Generated Summary

PDF311December 8, 2024