WildBench：実世界のユーザーから集めた挑戦的なタスクによるLLMベンチマーク

要旨

私たちは、WildBenchという自動評価フレームワークを紹介します。これは、挑戦的で現実世界のユーザークエリを用いて大規模言語モデル（LLMs）をベンチマークするために設計されています。WildBenchは、100万件以上の人間とチャットボットの会話ログから慎重に選ばれた1,024のタスクで構成されています。WildBenchを用いた自動評価のために、GPT-4-turboのような高度なLLMsを使用して計算可能な2つの指標、WB-RewardとWB-Scoreを開発しました。WildBenchの評価では、タスク固有のチェックリストを使用してモデルの出力を体系的に評価し、スコアと比較を正当化する構造化された説明を提供することで、より信頼性が高く解釈可能な自動判断を実現しています。WB-Rewardは、モデルの応答間の細かいペアワイズ比較を行い、5つの潜在的な結果を生成します：はるかに優れている、わずかに優れている、わずかに劣っている、はるかに劣っている、または引き分け。従来の評価が単一のベースラインモデルを使用していたのとは異なり、私たちは異なる性能レベルを持つ3つのベースラインモデルを選択し、包括的なペアワイズ評価を確保しました。さらに、勝者の応答が敗者の応答をK文字以上上回る場合、「わずかに優れている/劣っている」の結果を「引き分け」に変換することで、長さのバイアスを軽減する簡単な方法を提案します。WB-Scoreは、モデルの出力の品質を個別に評価するため、迅速でコスト効率の高い評価指標です。WildBenchの結果は、難しいタスクにおけるChatbot Arenaの人間による投票Eloレーティングと強い相関を示しています。具体的には、WB-Rewardはトップランクのモデルに対して0.98のピアソン相関を達成しています。さらに、WB-Scoreは0.95に達し、長さ制御された勝率におけるArenaHardの0.91とAlpacaEval2.0の0.89、および通常の勝率における0.87を上回っています。

English

We introduce WildBench, an automated evaluation framework designed to benchmark large language models (LLMs) using challenging, real-world user queries. WildBench consists of 1,024 tasks carefully selected from over one million human-chatbot conversation logs. For automated evaluation with WildBench, we have developed two metrics, WB-Reward and WB-Score, which are computable using advanced LLMs such as GPT-4-turbo. WildBench evaluation uses task-specific checklists to evaluate model outputs systematically and provides structured explanations that justify the scores and comparisons, resulting in more reliable and interpretable automatic judgments. WB-Reward employs fine-grained pairwise comparisons between model responses, generating five potential outcomes: much better, slightly better, slightly worse, much worse, or a tie. Unlike previous evaluations that employed a single baseline model, we selected three baseline models at varying performance levels to ensure a comprehensive pairwise evaluation. Additionally, we propose a simple method to mitigate length bias, by converting outcomes of ``slightly better/worse'' to ``tie'' if the winner response exceeds the loser one by more than K characters. WB-Score evaluates the quality of model outputs individually, making it a fast and cost-efficient evaluation metric. WildBench results demonstrate a strong correlation with the human-voted Elo ratings from Chatbot Arena on hard tasks. Specifically, WB-Reward achieves a Pearson correlation of 0.98 with top-ranking models. Additionally, WB-Score reaches 0.95, surpassing both ArenaHard's 0.91 and AlpacaEval2.0's 0.89 for length-controlled win rates, as well as the 0.87 for regular win rates.

WildBench：実世界のユーザーから集めた挑戦的なタスクによるLLMベンチマーク

WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

要旨

Support