Benchmark^2:大型語言模型基準的系統性評估
Benchmark^2: Systematic Evaluation of LLM Benchmarks
January 7, 2026
作者: Qi Qian, Chengsong Huang, Jingwen Xu, Changze Lv, Muling Wu, Wenhao Liu, Xiaohua Wang, Zhenghua Wang, Zisu Huang, Muzhao Tian, Jianhan Xu, Kun Hu, He-Da Wang, Yao Hu, Xuanjing Huang, Xiaoqing Zheng
cs.AI
摘要
大型語言模型評估基準的快速擴散,亟需建立系統性方法來檢驗基準本身的品質。我們提出Benchmark^2框架,該框架包含三項互補指標:(1)跨基準排名一致性,衡量基準能否產生與同類基準相符的模型排名;(2)區分度評分,量化基準區分不同模型的能力;(3)能力對齊偏差,用於識別同一模型家族中強模型失敗而弱模型成功的異常情況。我們在涵蓋數學、推理與知識領域的15個基準上進行廣泛實驗,評估來自四個模型家族的11個LLM。分析顯示現有基準存在顯著品質差異,並證明基於我們指標的選擇性基準建構方法,能以大幅精簡的測試集達到可比擬的評估效能。
English
The rapid proliferation of benchmarks for evaluating large language models (LLMs) has created an urgent need for systematic methods to assess benchmark quality itself. We propose Benchmark^2, a comprehensive framework comprising three complementary metrics: (1) Cross-Benchmark Ranking Consistency, measuring whether a benchmark produces model rankings aligned with peer benchmarks; (2) Discriminability Score, quantifying a benchmark's ability to differentiate between models; and (3) Capability Alignment Deviation, identifying problematic instances where stronger models fail but weaker models succeed within the same model family. We conduct extensive experiments across 15 benchmarks spanning mathematics, reasoning, and knowledge domains, evaluating 11 LLMs across four model families. Our analysis reveals significant quality variations among existing benchmarks and demonstrates that selective benchmark construction based on our metrics can achieve comparable evaluation performance with substantially reduced test sets.