基准一致性测试的正确实施:LLM基准评估指南
Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark Evaluation
July 18, 2024
作者: Yotam Perlitz, Ariel Gera, Ofir Arviv, Asaf Yehudai, Elron Bandel, Eyal Shnarch, Michal Shmueli-Scheuer, Leshem Choshen
cs.AI
摘要
最近语言模型(LMs)的进展推动了多个基准的创建,旨在评估这些模型的通用能力。然而,一个关键任务是评估基准本身的有效性。这通常通过基准一致性测试(BAT)来完成,新基准会使用某种一致性度量(例如,排名相关性)与已建立的基准进行验证。尽管BAT对于基准的构建者和用户至关重要,但目前尚无针对此类一致性测试的标准化程序。这种不足可能导致无效结论的产生,从而引发对基准的不信任,并破坏了正确选择适当基准的能力。通过分析超过40个知名基准,我们展示了一些被忽视的方法选择如何显著影响BAT结果,潜在地削弱结论的有效性。为解决这些不一致性,我们提出了一套BAT最佳实践,并展示了如何利用这些方法显著提高BAT的稳健性和有效性。为促进采用和促进未来研究,我们推出了BenchBench,一个用于BAT的Python软件包,并发布了BenchBench排行榜,一个元基准,旨在使用同行评估基准。我们的研究结果强调了标准化BAT的必要性,确保在语言模型研究不断发展的背景下基准评估的稳健性和有效性。
BenchBench软件包:https://github.com/IBM/BenchBench
排行榜:https://huggingface.co/spaces/per/BenchBench
English
Recent advancements in Language Models (LMs) have catalyzed the creation of
multiple benchmarks, designed to assess these models' general capabilities. A
crucial task, however, is assessing the validity of the benchmarks themselves.
This is most commonly done via Benchmark Agreement Testing (BAT), where new
benchmarks are validated against established ones using some agreement metric
(e.g., rank correlation). Despite the crucial role of BAT for benchmark
builders and consumers, there are no standardized procedures for such agreement
testing. This deficiency can lead to invalid conclusions, fostering mistrust in
benchmarks and upending the ability to properly choose the appropriate
benchmark to use. By analyzing over 40 prominent benchmarks, we demonstrate how
some overlooked methodological choices can significantly influence BAT results,
potentially undermining the validity of conclusions. To address these
inconsistencies, we propose a set of best practices for BAT and demonstrate how
utilizing these methodologies greatly improves BAT robustness and validity. To
foster adoption and facilitate future research,, we introduce BenchBench, a
python package for BAT, and release the BenchBench-leaderboard, a
meta-benchmark designed to evaluate benchmarks using their peers. Our findings
underscore the necessity for standardized BAT, ensuring the robustness and
validity of benchmark evaluations in the evolving landscape of language model
research.
BenchBench Package: https://github.com/IBM/BenchBench
Leaderboard: https://huggingface.co/spaces/per/BenchBenchSummary
AI-Generated Summary