大型語言模型基準測試綜述
A Survey on Large Language Model Benchmarks
August 21, 2025
作者: Shiwen Ni, Guhong Chen, Shuaimin Li, Xuanang Chen, Siyi Li, Bingli Wang, Qiyao Wang, Xingjian Wang, Yifan Zhang, Liyang Fan, Chengming Li, Ruifeng Xu, Le Sun, Min Yang
cs.AI
摘要
近年來,隨著大型語言模型能力的深度與廣度快速發展,相應的各種評估基準也日益湧現。作為模型性能的量化評估工具,基準不僅是衡量模型能力的核心手段,更是引導模型發展方向、推動技術創新的關鍵要素。我們首次系統性地回顧了大型語言模型基準的現狀與發展,將283個代表性基準分為三大類:通用能力、特定領域和特定目標。通用能力基準涵蓋核心語言學、知識與推理等方面;特定領域基準聚焦於自然科學、人文社會科學及工程技術等領域;特定目標基準則關注風險、可靠性、智能體等問題。我們指出,當前基準存在數據污染導致分數膨脹、文化與語言偏見造成評估不公,以及缺乏對過程可信度和動態環境的評估等問題,並為未來基準創新提供了可參考的設計範式。
English
In recent years, with the rapid development of the depth and breadth of large
language models' capabilities, various corresponding evaluation benchmarks have
been emerging in increasing numbers. As a quantitative assessment tool for
model performance, benchmarks are not only a core means to measure model
capabilities but also a key element in guiding the direction of model
development and promoting technological innovation. We systematically review
the current status and development of large language model benchmarks for the
first time, categorizing 283 representative benchmarks into three categories:
general capabilities, domain-specific, and target-specific. General capability
benchmarks cover aspects such as core linguistics, knowledge, and reasoning;
domain-specific benchmarks focus on fields like natural sciences, humanities
and social sciences, and engineering technology; target-specific benchmarks pay
attention to risks, reliability, agents, etc. We point out that current
benchmarks have problems such as inflated scores caused by data contamination,
unfair evaluation due to cultural and linguistic biases, and lack of evaluation
on process credibility and dynamic environments, and provide a referable design
paradigm for future benchmark innovation.