BenchHub: 包括的かつカスタマイズ可能なLLM評価のための統合ベンチマークスイート

要旨

大規模言語モデル（LLM）の進化が続く中、最新かつ体系的なベンチマークの必要性がますます重要となっている。しかし、既存のデータセットの多くは散在しており、管理が難しく、特定のニーズやドメインに合わせた評価を行うことが困難である。特に、数学やコードなどの分野におけるドメイン特化モデルの重要性が高まっているにもかかわらず、この課題は顕著である。本論文では、研究者や開発者がLLMをより効果的に評価することを可能にする動的なベンチマークリポジトリであるBenchHubを紹介する。BenchHubは、多様なドメインからベンチマークデータセットを集約し、自動的に分類する。38のベンチマークにわたる303Kの質問を統合し、継続的な更新とスケーラブルなデータ管理をサポートする設計となっている。これにより、さまざまなドメインやユースケースに合わせた柔軟でカスタマイズ可能な評価が可能となる。さまざまなLLMファミリーを用いた広範な実験を通じて、モデルの性能がドメイン固有のサブセット間で大きく異なることを示し、ドメインを意識したベンチマークの重要性を強調する。BenchHubは、データセットの再利用を促進し、モデル比較の透明性を高め、既存のベンチマークにおける過小評価されている領域を容易に特定するための重要なインフラを提供し、LLM評価研究の進展に貢献すると考えられる。

English

As large language models (LLMs) continue to advance, the need for up-to-date and well-organized benchmarks becomes increasingly critical. However, many existing datasets are scattered, difficult to manage, and make it challenging to perform evaluations tailored to specific needs or domains, despite the growing importance of domain-specific models in areas such as math or code. In this paper, we introduce BenchHub, a dynamic benchmark repository that empowers researchers and developers to evaluate LLMs more effectively. BenchHub aggregates and automatically classifies benchmark datasets from diverse domains, integrating 303K questions across 38 benchmarks. It is designed to support continuous updates and scalable data management, enabling flexible and customizable evaluation tailored to various domains or use cases. Through extensive experiments with various LLM families, we demonstrate that model performance varies significantly across domain-specific subsets, emphasizing the importance of domain-aware benchmarking. We believe BenchHub can encourage better dataset reuse, more transparent model comparisons, and easier identification of underrepresented areas in existing benchmarks, offering a critical infrastructure for advancing LLM evaluation research.

BenchHub: 包括的かつカスタマイズ可能なLLM評価のための統合ベンチマークスイート

BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation

要旨

Support