ベンチマーク合意テストの正しい実施方法：LLMベンチマーク評価のためのガイド

要旨

最近の言語モデル（LMs）の進歩により、これらのモデルの汎用的な能力を評価するための複数のベンチマークが作成されています。しかし、重要な課題は、ベンチマーク自体の妥当性を評価することです。これは通常、Benchmark Agreement Testing（BAT）を通じて行われ、新しいベンチマークが確立されたベンチマークに対して何らかの一致度指標（例：順位相関）を用いて検証されます。BATはベンチマークの作成者や利用者にとって重要な役割を果たしていますが、このような一致度テストのための標準化された手順は存在しません。この欠如は、無効な結論を導き、ベンチマークに対する不信を招き、適切なベンチマークを選択する能力を損なう可能性があります。40以上の主要なベンチマークを分析することで、いくつかの見過ごされがちな方法論的選択がBATの結果に大きな影響を与え、結論の妥当性を損なう可能性があることを示します。これらの不整合に対処するため、BATのためのベストプラクティスを提案し、これらの方法論を活用することでBATの堅牢性と妥当性が大幅に向上することを実証します。採用を促進し、将来の研究を容易にするために、BATのためのPythonパッケージであるBenchBenchを導入し、ベンチマークをその同僚を用いて評価するためのメタベンチマークであるBenchBench-leaderboardを公開します。我々の調査結果は、言語モデル研究の進化する状況において、ベンチマーク評価の堅牢性と妥当性を確保するための標準化されたBATの必要性を強調しています。 BenchBenchパッケージ: https://github.com/IBM/BenchBench リーダーボード: https://huggingface.co/spaces/per/BenchBench

English

Recent advancements in Language Models (LMs) have catalyzed the creation of multiple benchmarks, designed to assess these models' general capabilities. A crucial task, however, is assessing the validity of the benchmarks themselves. This is most commonly done via Benchmark Agreement Testing (BAT), where new benchmarks are validated against established ones using some agreement metric (e.g., rank correlation). Despite the crucial role of BAT for benchmark builders and consumers, there are no standardized procedures for such agreement testing. This deficiency can lead to invalid conclusions, fostering mistrust in benchmarks and upending the ability to properly choose the appropriate benchmark to use. By analyzing over 40 prominent benchmarks, we demonstrate how some overlooked methodological choices can significantly influence BAT results, potentially undermining the validity of conclusions. To address these inconsistencies, we propose a set of best practices for BAT and demonstrate how utilizing these methodologies greatly improves BAT robustness and validity. To foster adoption and facilitate future research,, we introduce BenchBench, a python package for BAT, and release the BenchBench-leaderboard, a meta-benchmark designed to evaluate benchmarks using their peers. Our findings underscore the necessity for standardized BAT, ensuring the robustness and validity of benchmark evaluations in the evolving landscape of language model research. BenchBench Package: https://github.com/IBM/BenchBench Leaderboard: https://huggingface.co/spaces/per/BenchBench

ベンチマーク合意テストの正しい実施方法：LLMベンチマーク評価のためのガイド

Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark Evaluation

要旨

Support