ChatPaper.aiChatPaper

大型語言模型基準測試綜述

A Survey on Large Language Model Benchmarks

August 21, 2025
作者: Shiwen Ni, Guhong Chen, Shuaimin Li, Xuanang Chen, Siyi Li, Bingli Wang, Qiyao Wang, Xingjian Wang, Yifan Zhang, Liyang Fan, Chengming Li, Ruifeng Xu, Le Sun, Min Yang
cs.AI

摘要

近年來,隨著大型語言模型能力的深度與廣度快速發展,相應的各種評估基準也日益湧現。作為模型性能的量化評估工具,基準不僅是衡量模型能力的核心手段,更是引導模型發展方向、推動技術創新的關鍵要素。我們首次系統性地回顧了大型語言模型基準的現狀與發展,將283個代表性基準分為三大類:通用能力、特定領域和特定目標。通用能力基準涵蓋核心語言學、知識與推理等方面;特定領域基準聚焦於自然科學、人文社會科學及工程技術等領域;特定目標基準則關注風險、可靠性、智能體等問題。我們指出,當前基準存在數據污染導致分數膨脹、文化與語言偏見造成評估不公,以及缺乏對過程可信度和動態環境的評估等問題,並為未來基準創新提供了可參考的設計範式。
English
In recent years, with the rapid development of the depth and breadth of large language models' capabilities, various corresponding evaluation benchmarks have been emerging in increasing numbers. As a quantitative assessment tool for model performance, benchmarks are not only a core means to measure model capabilities but also a key element in guiding the direction of model development and promoting technological innovation. We systematically review the current status and development of large language model benchmarks for the first time, categorizing 283 representative benchmarks into three categories: general capabilities, domain-specific, and target-specific. General capability benchmarks cover aspects such as core linguistics, knowledge, and reasoning; domain-specific benchmarks focus on fields like natural sciences, humanities and social sciences, and engineering technology; target-specific benchmarks pay attention to risks, reliability, agents, etc. We point out that current benchmarks have problems such as inflated scores caused by data contamination, unfair evaluation due to cultural and linguistic biases, and lack of evaluation on process credibility and dynamic environments, and provide a referable design paradigm for future benchmark innovation.
PDF172August 22, 2025