BenchHub：一個統一且可自訂的LLM全方位評估基準套件

摘要

隨著大型語言模型（LLMs）的不斷進步，對最新且組織良好的基準測試的需求變得日益關鍵。然而，許多現有的數據集分散且難以管理，儘管在數學或代碼等領域中，特定領域模型的重要性日益增加，但這些數據集仍使得針對特定需求或領域進行評估變得具有挑戰性。在本文中，我們介紹了BenchHub，這是一個動態的基準測試存儲庫，旨在使研究人員和開發者能夠更有效地評估LLMs。BenchHub匯總並自動分類來自不同領域的基準測試數據集，整合了38個基準測試中的303K個問題。它設計用於支持持續更新和可擴展的數據管理，從而實現針對各種領域或使用案例的靈活且可定制的評估。通過對各種LLM家族進行廣泛實驗，我們展示了模型在特定領域子集上的表現存在顯著差異，這強調了領域感知基準測試的重要性。我們相信，BenchHub可以促進更好的數據集重用、更透明的模型比較以及更容易識別現有基準測試中代表性不足的領域，為推進LLM評估研究提供關鍵基礎設施。

English

As large language models (LLMs) continue to advance, the need for up-to-date and well-organized benchmarks becomes increasingly critical. However, many existing datasets are scattered, difficult to manage, and make it challenging to perform evaluations tailored to specific needs or domains, despite the growing importance of domain-specific models in areas such as math or code. In this paper, we introduce BenchHub, a dynamic benchmark repository that empowers researchers and developers to evaluate LLMs more effectively. BenchHub aggregates and automatically classifies benchmark datasets from diverse domains, integrating 303K questions across 38 benchmarks. It is designed to support continuous updates and scalable data management, enabling flexible and customizable evaluation tailored to various domains or use cases. Through extensive experiments with various LLM families, we demonstrate that model performance varies significantly across domain-specific subsets, emphasizing the importance of domain-aware benchmarking. We believe BenchHub can encourage better dataset reuse, more transparent model comparisons, and easier identification of underrepresented areas in existing benchmarks, offering a critical infrastructure for advancing LLM evaluation research.

BenchHub：一個統一且可自訂的LLM全方位評估基準套件

BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation

摘要

Support