ChatPaper.aiChatPaper

基準一致性測試的正確執行:LLM基準評估指南

Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark Evaluation

July 18, 2024
作者: Yotam Perlitz, Ariel Gera, Ofir Arviv, Asaf Yehudai, Elron Bandel, Eyal Shnarch, Michal Shmueli-Scheuer, Leshem Choshen
cs.AI

摘要

最近語言模型(LMs)的進步催生了多個基準,旨在評估這些模型的一般能力。然而,一個至關重要的任務是評估這些基準本身的有效性。最常見的方法是通過基準協議測試(BAT)來進行,其中新的基準會使用某種協議度量(例如,排名相關性)來與已建立的基準進行驗證。儘管BAT對於基準構建者和使用者至關重要,但目前尚無標準化的程序進行此類協議測試。這種不足可能導致無效的結論,從而培養對基準的不信任,並破壞正確選擇適當基準的能力。通過分析40多個知名基準,我們展示了一些被忽視的方法選擇如何顯著影響BAT結果,潛在地削弱結論的有效性。為了解決這些不一致性,我們提出了一套BAT最佳實踐,並展示了如何利用這些方法極大地提高了BAT的韌性和有效性。為了促進採用並促進未來研究,我們推出了BenchBench,一個用於BAT的Python套件,並發布了BenchBench排行榜,這是一個設計用於使用同行基準來評估基準的元基準。我們的研究強調了標準化BAT的必要性,確保在語言模型研究不斷發展的格局中基準評估的韌性和有效性。 BenchBench套件:https://github.com/IBM/BenchBench 排行榜:https://huggingface.co/spaces/per/BenchBench
English
Recent advancements in Language Models (LMs) have catalyzed the creation of multiple benchmarks, designed to assess these models' general capabilities. A crucial task, however, is assessing the validity of the benchmarks themselves. This is most commonly done via Benchmark Agreement Testing (BAT), where new benchmarks are validated against established ones using some agreement metric (e.g., rank correlation). Despite the crucial role of BAT for benchmark builders and consumers, there are no standardized procedures for such agreement testing. This deficiency can lead to invalid conclusions, fostering mistrust in benchmarks and upending the ability to properly choose the appropriate benchmark to use. By analyzing over 40 prominent benchmarks, we demonstrate how some overlooked methodological choices can significantly influence BAT results, potentially undermining the validity of conclusions. To address these inconsistencies, we propose a set of best practices for BAT and demonstrate how utilizing these methodologies greatly improves BAT robustness and validity. To foster adoption and facilitate future research,, we introduce BenchBench, a python package for BAT, and release the BenchBench-leaderboard, a meta-benchmark designed to evaluate benchmarks using their peers. Our findings underscore the necessity for standardized BAT, ensuring the robustness and validity of benchmark evaluations in the evolving landscape of language model research. BenchBench Package: https://github.com/IBM/BenchBench Leaderboard: https://huggingface.co/spaces/per/BenchBench

Summary

AI-Generated Summary

PDF53November 28, 2024