大型语言模型基准测试综述

摘要

近年来，随着大语言模型能力深度与广度的快速发展，各类相应的评估基准层出不穷。作为模型性能的量化评估工具，基准不仅是衡量模型能力的核心手段，更是引导模型发展方向、推动技术创新的关键要素。我们首次系统梳理了大语言模型基准的现状与发展，将283个代表性基准划分为通用能力、领域特定和目标特定三大类。通用能力基准涵盖核心语言学、知识与推理等方面；领域特定基准聚焦自然科学、人文社科与工程技术等领域；目标特定基准则关注风险、可靠性、智能体等维度。我们指出当前基准存在数据污染导致的分数膨胀、文化与语言偏见造成的不公平评估，以及对过程可信度与动态环境评估的缺失等问题，并为未来基准创新提供了可参考的设计范式。

English

In recent years, with the rapid development of the depth and breadth of large language models' capabilities, various corresponding evaluation benchmarks have been emerging in increasing numbers. As a quantitative assessment tool for model performance, benchmarks are not only a core means to measure model capabilities but also a key element in guiding the direction of model development and promoting technological innovation. We systematically review the current status and development of large language model benchmarks for the first time, categorizing 283 representative benchmarks into three categories: general capabilities, domain-specific, and target-specific. General capability benchmarks cover aspects such as core linguistics, knowledge, and reasoning; domain-specific benchmarks focus on fields like natural sciences, humanities and social sciences, and engineering technology; target-specific benchmarks pay attention to risks, reliability, agents, etc. We point out that current benchmarks have problems such as inflated scores caused by data contamination, unfair evaluation due to cultural and linguistic biases, and lack of evaluation on process credibility and dynamic environments, and provide a referable design paradigm for future benchmark innovation.

大型语言模型基准测试综述

A Survey on Large Language Model Benchmarks

摘要

Support