ChatPaper.aiChatPaper

从众包数据到高质量基准:Arena-Hard 和 BenchBuilder 管道

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

June 17, 2024
作者: Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez, Ion Stoica
cs.AI

摘要

语言模型的快速发展促使了更具挑战性基准的开发。当前的静态基准往往难以始终准确区分不同模型的能力,并且无法与真实用户偏好相一致。另一方面,像Chatbot Arena这样的实时众包平台收集了各种自然提示和用户反馈。然而,这些提示在复杂性上存在差异,反馈无法离线应用于新模型。为了确保基准跟上LLM发展的步伐,我们研究了如何评估基准在自信地区分模型和与人类偏好一致方面的能力。基于这些原则,我们开发了BenchBuilder,这是一个动态基准,从实时数据源中筛选高质量提示,以便对新的具有挑战性的提示进行离线评估。BenchBuilder确定了高质量提示的七个指标,如对领域知识的要求,并利用LLM注释器从各种主题集群中选择高质量提示的子集。LLM评估过程利用LLM评审员确保完全自动化、高质量且不断更新的基准。我们将BenchBuilder应用于Chatbot Arena的提示,创建了Arena-Hard-Auto v0.1:来自各种任务的500个具有挑战性的用户提示。Arena-Hard-Auto v0.1提供比MT-Bench更紧凑的3倍置信区间,并以仅25美元的成本且无需人工标注者,实现了与人类偏好排名的最新89.1%一致性。BenchBuilder流程增强了评估基准,并为开发人员提供了一个宝贵的工具,使他们能够从大量数据中轻松提取高质量基准。
English
The rapid evolution of language models has necessitated the development of more challenging benchmarks. Current static benchmarks often struggle to consistently distinguish between the capabilities of different models and fail to align with real-world user preferences. On the other hand, live crowd-sourced platforms like the Chatbot Arena collect a wide range of natural prompts and user feedback. However, these prompts vary in sophistication and the feedback cannot be applied offline to new models. In order to ensure that benchmarks keep up with the pace of LLM development, we address how one can evaluate benchmarks on their ability to confidently separate models and their alignment with human preference. Under these principles, we developed BenchBuilder, a living benchmark that filters high-quality prompts from live data sources to enable offline evaluation on fresh, challenging prompts. BenchBuilder identifies seven indicators of a high-quality prompt, such as the requirement for domain knowledge, and utilizes an LLM annotator to select a high-quality subset of prompts from various topic clusters. The LLM evaluation process employs an LLM judge to ensure a fully automated, high-quality, and constantly updating benchmark. We apply BenchBuilder on prompts from the Chatbot Arena to create Arena-Hard-Auto v0.1: 500 challenging user prompts from a wide range of tasks. Arena-Hard-Auto v0.1 offers 3x tighter confidence intervals than MT-Bench and achieves a state-of-the-art 89.1% agreement with human preference rankings, all at a cost of only $25 and without human labelers. The BenchBuilder pipeline enhances evaluation benchmarks and provides a valuable tool for developers, enabling them to extract high-quality benchmarks from extensive data with minimal effort.

Summary

AI-Generated Summary

PDF71December 4, 2024