LLM 벤치마크 평가를 위한 올바른 벤치마크 합의 테스트: 가이드

초록

최근 언어 모델(Language Models, LMs)의 발전은 이러한 모델의 일반적인 능력을 평가하기 위해 여러 벤치마크의 개발을 촉진시켰습니다. 그러나 중요한 과제는 벤치마크 자체의 타당성을 평가하는 것입니다. 이는 주로 벤치마크 일치성 테스트(Benchmark Agreement Testing, BAT)를 통해 이루어지며, 새로운 벤치마크가 기존 벤치마크와의 일치성을 어떤 일치성 지표(예: 순위 상관관계)를 사용하여 검증하는 과정입니다. BAT가 벤치마크 개발자와 사용자에게 중요한 역할을 함에도 불구하고, 이러한 일치성 테스트를 위한 표준화된 절차가 부재합니다. 이러한 결함은 잘못된 결론을 초래할 수 있으며, 벤치마크에 대한 불신을 조장하고 적절한 벤치마크를 선택하는 능력을 저해할 수 있습니다. 우리는 40개 이상의 주요 벤치마크를 분석하여, 간과된 방법론적 선택이 BAT 결과에 상당한 영향을 미치고 결론의 타당성을 훼손할 수 있음을 보여줍니다. 이러한 불일치를 해결하기 위해, 우리는 BAT를 위한 일련의 모범 사례를 제안하고 이러한 방법론을 활용함으로써 BAT의 견고성과 타당성이 크게 개선됨을 입증합니다. 이를 더욱 확산시키고 향후 연구를 용이하게 하기 위해, 우리는 BAT를 위한 파이썬 패키지인 BenchBench와 벤치마크를 동료 평가를 통해 평가하기 위한 메타 벤치마크인 BenchBench-leaderboard를 공개합니다. 우리의 연구 결과는 언어 모델 연구의 진화하는 환경에서 벤치마크 평가의 견고성과 타당성을 보장하기 위해 표준화된 BAT의 필요성을 강조합니다. BenchBench 패키지: https://github.com/IBM/BenchBench 리더보드: https://huggingface.co/spaces/per/BenchBench

English

Recent advancements in Language Models (LMs) have catalyzed the creation of multiple benchmarks, designed to assess these models' general capabilities. A crucial task, however, is assessing the validity of the benchmarks themselves. This is most commonly done via Benchmark Agreement Testing (BAT), where new benchmarks are validated against established ones using some agreement metric (e.g., rank correlation). Despite the crucial role of BAT for benchmark builders and consumers, there are no standardized procedures for such agreement testing. This deficiency can lead to invalid conclusions, fostering mistrust in benchmarks and upending the ability to properly choose the appropriate benchmark to use. By analyzing over 40 prominent benchmarks, we demonstrate how some overlooked methodological choices can significantly influence BAT results, potentially undermining the validity of conclusions. To address these inconsistencies, we propose a set of best practices for BAT and demonstrate how utilizing these methodologies greatly improves BAT robustness and validity. To foster adoption and facilitate future research,, we introduce BenchBench, a python package for BAT, and release the BenchBench-leaderboard, a meta-benchmark designed to evaluate benchmarks using their peers. Our findings underscore the necessity for standardized BAT, ensuring the robustness and validity of benchmark evaluations in the evolving landscape of language model research. BenchBench Package: https://github.com/IBM/BenchBench Leaderboard: https://huggingface.co/spaces/per/BenchBench

LLM 벤치마크 평가를 위한 올바른 벤치마크 합의 테스트: 가이드

Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark Evaluation

초록

Support