WildBench: 실 사용자들의 도전적인 과제를 통해 LLM 벤치마킹하기

초록

우리는 도전적이고 실제 사용자 질의를 활용해 대규모 언어 모델(LLM)을 벤치마킹하기 위해 설계된 자동 평가 프레임워크인 WildBench를 소개합니다. WildBench은 100만 개 이상의 인간-챗봇 대화 로그에서 신중하게 선별된 1,024개의 과제로 구성되어 있습니다. WildBench을 통한 자동 평가를 위해, 우리는 GPT-4-turbo와 같은 고급 LLM을 사용하여 계산 가능한 두 가지 지표인 WB-Reward와 WB-Score를 개발했습니다. WildBench 평가는 과제별 체크리스트를 사용해 모델 출력을 체계적으로 평가하고, 점수와 비교를 정당화하는 구조화된 설명을 제공함으로써 더 신뢰할 수 있고 해석 가능한 자동 판단을 가능하게 합니다. WB-Reward는 모델 응답 간의 세분화된 쌍별 비교를 통해 다섯 가지 잠재적 결과를 생성합니다: 훨씬 나음, 약간 나음, 약간 못함, 훨씬 못함, 또는 무승부. 이전 평가들이 단일 기준 모델을 사용했던 것과 달리, 우리는 다양한 성능 수준의 세 가지 기준 모델을 선택하여 포괄적인 쌍별 평가를 보장했습니다. 또한, 우리는 길이 편향을 완화하기 위한 간단한 방법을 제안합니다. 이 방법은 '약간 나음/못함' 결과를 '무승부'로 전환하는데, 승리 응답이 패배 응답보다 K자 이상 길 경우에 적용됩니다. WB-Score는 모델 출력의 품질을 개별적으로 평가하여 빠르고 비용 효율적인 평가 지표로 기능합니다. WildBench 결과는 Chatbot Arena의 인간 투표 Elo 등급과 어려운 과제에서 강한 상관관계를 보입니다. 특히, WB-Reward는 상위 랭킹 모델들과 0.98의 피어슨 상관계수를 달성했습니다. 또한, WB-Score는 0.95에 도달하여 ArenaHard의 0.91과 AlpacaEval2.0의 길이 제어 승률 0.89, 그리고 일반 승률 0.87을 모두 능가했습니다.

English

We introduce WildBench, an automated evaluation framework designed to benchmark large language models (LLMs) using challenging, real-world user queries. WildBench consists of 1,024 tasks carefully selected from over one million human-chatbot conversation logs. For automated evaluation with WildBench, we have developed two metrics, WB-Reward and WB-Score, which are computable using advanced LLMs such as GPT-4-turbo. WildBench evaluation uses task-specific checklists to evaluate model outputs systematically and provides structured explanations that justify the scores and comparisons, resulting in more reliable and interpretable automatic judgments. WB-Reward employs fine-grained pairwise comparisons between model responses, generating five potential outcomes: much better, slightly better, slightly worse, much worse, or a tie. Unlike previous evaluations that employed a single baseline model, we selected three baseline models at varying performance levels to ensure a comprehensive pairwise evaluation. Additionally, we propose a simple method to mitigate length bias, by converting outcomes of ``slightly better/worse'' to ``tie'' if the winner response exceeds the loser one by more than K characters. WB-Score evaluates the quality of model outputs individually, making it a fast and cost-efficient evaluation metric. WildBench results demonstrate a strong correlation with the human-voted Elo ratings from Chatbot Arena on hard tasks. Specifically, WB-Reward achieves a Pearson correlation of 0.98 with top-ranking models. Additionally, WB-Score reaches 0.95, surpassing both ArenaHard's 0.91 and AlpacaEval2.0's 0.89 for length-controlled win rates, as well as the 0.87 for regular win rates.

WildBench: 실 사용자들의 도전적인 과제를 통해 LLM 벤치마킹하기

WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

초록

Support