WildBench:通過野外真實用戶的挑戰性任務對LLM進行基準測試
WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild
June 7, 2024
作者: Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, Yejin Choi
cs.AI
摘要
我們介紹了 WildBench,這是一個自動化評估框架,旨在使用具有挑戰性的真實用戶查詢來評估大型語言模型(LLMs)。WildBench 包含了從超過一百萬個人-聊天機器人對話日誌中精心挑選出的 1,024 個任務。為了使用 WildBench 進行自動評估,我們開發了兩個指標,即 WB-Reward 和 WB-Score,可以使用像 GPT-4-turbo 這樣的先進LLMs 進行計算。WildBench 評估使用任務特定的檢查表系統地評估模型輸出,並提供結構化解釋,用以證明分數和比較,從而產生更可靠和可解釋的自動判斷。WB-Reward 使用模型回應之間的細粒度兩兩比較,生成五種潛在結果:更好得多、稍微更好、稍微更差、更差得多或平局。與以往僅使用單個基準模型進行評估不同,我們選擇了三個基準模型,其性能水平不同,以確保全面的兩兩評估。此外,我們提出了一種簡單的方法來緩解長度偏差,即將“稍微更好/更差”的結果轉換為“平局”,如果獲勝者的回應比輸家的回應多 K 個字符以上。WB-Score 逐個評估模型輸出的質量,使其成為一個快速且具有成本效益的評估指標。WildBench 的結果顯示與 Chatbot Arena 上難度較高任務的人工投票 Elo 評分之間存在較強的相關性。具體而言,WB-Reward 與排名靠前的模型實現了 0.98 的皮爾森相關性。此外,WB-Score 達到了 0.95,超過了 ArenaHard 的 0.91 和 AlpacaEval2.0 的 0.89,以及常規勝率的 0.87。
English
We introduce WildBench, an automated evaluation framework designed to
benchmark large language models (LLMs) using challenging, real-world user
queries. WildBench consists of 1,024 tasks carefully selected from over one
million human-chatbot conversation logs. For automated evaluation with
WildBench, we have developed two metrics, WB-Reward and WB-Score, which are
computable using advanced LLMs such as GPT-4-turbo. WildBench evaluation uses
task-specific checklists to evaluate model outputs systematically and provides
structured explanations that justify the scores and comparisons, resulting in
more reliable and interpretable automatic judgments. WB-Reward employs
fine-grained pairwise comparisons between model responses, generating five
potential outcomes: much better, slightly better, slightly worse, much worse,
or a tie. Unlike previous evaluations that employed a single baseline model, we
selected three baseline models at varying performance levels to ensure a
comprehensive pairwise evaluation. Additionally, we propose a simple method to
mitigate length bias, by converting outcomes of ``slightly better/worse'' to
``tie'' if the winner response exceeds the loser one by more than K
characters. WB-Score evaluates the quality of model outputs individually,
making it a fast and cost-efficient evaluation metric. WildBench results
demonstrate a strong correlation with the human-voted Elo ratings from Chatbot
Arena on hard tasks. Specifically, WB-Reward achieves a Pearson correlation of
0.98 with top-ranking models. Additionally, WB-Score reaches 0.95, surpassing
both ArenaHard's 0.91 and AlpacaEval2.0's 0.89 for length-controlled win rates,
as well as the 0.87 for regular win rates.Summary
AI-Generated Summary