BenchHub：一个统一且可定制的LLM全方位评估基准套件

摘要

随着大型语言模型（LLMs）的持续进步，对最新且组织良好的基准测试的需求变得日益关键。然而，尽管在数学或代码等领域中特定领域模型的重要性日益增长，许多现有数据集仍分散、难以管理，使得针对特定需求或领域进行定制化评估变得颇具挑战。本文中，我们介绍了BenchHub，一个动态基准测试库，旨在赋能研究者和开发者更有效地评估LLMs。BenchHub汇集并自动分类来自不同领域的基准测试数据集，整合了38个基准中的303K个问题。它设计用于支持持续更新和可扩展的数据管理，从而实现对各种领域或使用场景的灵活且可定制的评估。通过对多种LLM家族进行广泛实验，我们证明了模型性能在特定领域子集间存在显著差异，强调了领域感知基准测试的重要性。我们相信，BenchHub能够促进更好的数据集复用、更透明的模型比较，以及更轻松地识别现有基准中代表性不足的领域，为推进LLM评估研究提供关键基础设施。

English

As large language models (LLMs) continue to advance, the need for up-to-date and well-organized benchmarks becomes increasingly critical. However, many existing datasets are scattered, difficult to manage, and make it challenging to perform evaluations tailored to specific needs or domains, despite the growing importance of domain-specific models in areas such as math or code. In this paper, we introduce BenchHub, a dynamic benchmark repository that empowers researchers and developers to evaluate LLMs more effectively. BenchHub aggregates and automatically classifies benchmark datasets from diverse domains, integrating 303K questions across 38 benchmarks. It is designed to support continuous updates and scalable data management, enabling flexible and customizable evaluation tailored to various domains or use cases. Through extensive experiments with various LLM families, we demonstrate that model performance varies significantly across domain-specific subsets, emphasizing the importance of domain-aware benchmarking. We believe BenchHub can encourage better dataset reuse, more transparent model comparisons, and easier identification of underrepresented areas in existing benchmarks, offering a critical infrastructure for advancing LLM evaluation research.

BenchHub：一个统一且可定制的LLM全方位评估基准套件

BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation

摘要

Support