WildBench：使用来自真实用户的具有挑战性任务对LLM进行基准测试

摘要

我们介绍了WildBench，这是一个自动化评估框架，旨在使用具有挑战性的真实用户查询对大型语言模型（LLMs）进行基准测试。WildBench包括从超过一百万个人-聊天机器人对话日志中精心挑选出的1,024个任务。为了使用WildBench进行自动化评估，我们开发了两个指标，即WB-Reward和WB-Score，这些指标可使用高级LLMs（如GPT-4-turbo）进行计算。WildBench评估使用特定任务的检查表系统地评估模型输出，并提供结构化解释，用以证明得分和比较，从而产生更可靠和可解释的自动判断。WB-Reward采用模型响应之间的细粒度两两比较，生成五种潜在结果：更好得多、稍微更好、稍微更差、更差得多或平局。与以往只使用单个基准模型的评估不同，我们选择了三个基准模型，性能水平不同，以确保全面的两两评估。此外，我们提出了一种简单的方法来减轻长度偏差，即如果获胜响应超过败者响应超过K个字符，则将“稍微更好/更差”的结果转换为“平局”。WB-Score单独评估模型输出的质量，使其成为一种快速且成本效益高的评估指标。WildBench的结果显示，与Chatbot Arena上难度较大任务的人类评分Elo评级强相关。具体而言，WB-Reward与排名靠前的模型达到了0.98的皮尔逊相关性。此外，WB-Score达到了0.95，超过了ArenaHard的0.91和AlpacaEval2.0的0.89的长度控制胜率，以及0.87的常规胜率。

English

We introduce WildBench, an automated evaluation framework designed to benchmark large language models (LLMs) using challenging, real-world user queries. WildBench consists of 1,024 tasks carefully selected from over one million human-chatbot conversation logs. For automated evaluation with WildBench, we have developed two metrics, WB-Reward and WB-Score, which are computable using advanced LLMs such as GPT-4-turbo. WildBench evaluation uses task-specific checklists to evaluate model outputs systematically and provides structured explanations that justify the scores and comparisons, resulting in more reliable and interpretable automatic judgments. WB-Reward employs fine-grained pairwise comparisons between model responses, generating five potential outcomes: much better, slightly better, slightly worse, much worse, or a tie. Unlike previous evaluations that employed a single baseline model, we selected three baseline models at varying performance levels to ensure a comprehensive pairwise evaluation. Additionally, we propose a simple method to mitigate length bias, by converting outcomes of ``slightly better/worse'' to ``tie'' if the winner response exceeds the loser one by more than K characters. WB-Score evaluates the quality of model outputs individually, making it a fast and cost-efficient evaluation metric. WildBench results demonstrate a strong correlation with the human-voted Elo ratings from Chatbot Arena on hard tasks. Specifically, WB-Reward achieves a Pearson correlation of 0.98 with top-ranking models. Additionally, WB-Score reaches 0.95, surpassing both ArenaHard's 0.91 and AlpacaEval2.0's 0.89 for length-controlled win rates, as well as the 0.87 for regular win rates.

WildBench：使用来自真实用户的具有挑战性任务对LLM进行基准测试

WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

摘要

Support