大型语言模型评估调查

A Survey on Evaluation of Large Language Models

July 6, 2023

作者: Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Kaijie Zhu, Hao Chen, Linyi Yang, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, Xing Xie

cs.AI

摘要

大型语言模型（LLMs）在学术界和工业界越来越受欢迎，这归功于它们在各种应用中表现出的前所未有的性能。随着LLMs在研究和日常使用中继续发挥关键作用，对它们的评估变得越来越关键，不仅在任务层面，还在社会层面，以更好地理解它们潜在的风险。在过去的几年里，人们已经做出了重大努力，从各种角度审视LLMs。本文综述了针对LLMs的这些评估方法，重点关注三个关键维度：评估什么、在哪里评估以及如何评估。首先，我们从评估任务的角度提供了一个概述，涵盖了一般自然语言处理任务、推理、医疗用途、伦理、教育、自然和社会科学、代理应用以及其他领域。其次，我们通过深入研究评估方法和基准来回答“在哪里”和“如何”这两个问题，这些是评估LLMs性能的关键组成部分。然后，我们总结了LLMs在不同任务中的成功和失败案例。最后，我们对LLMs评估面临的几个未来挑战进行了探讨。我们的目标是为LLMs评估领域的研究人员提供宝贵的见解，从而促进更有效的LLMs的发展。我们的关键观点是，评估应被视为更好地协助LLMs发展的一门重要学科。我们始终保持相关的开源材料在以下链接中：https://github.com/MLGroupJLU/LLM-eval-survey。

English

Large language models (LLMs) are gaining increasing popularity in both academia and industry, owing to their unprecedented performance in various applications. As LLMs continue to play a vital role in both research and daily use, their evaluation becomes increasingly critical, not only at the task level, but also at the society level for better understanding of their potential risks. Over the past years, significant efforts have been made to examine LLMs from various perspectives. This paper presents a comprehensive review of these evaluation methods for LLMs, focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate. Firstly, we provide an overview from the perspective of evaluation tasks, encompassing general natural language processing tasks, reasoning, medical usage, ethics, educations, natural and social sciences, agent applications, and other areas. Secondly, we answer the `where' and `how' questions by diving into the evaluation methods and benchmarks, which serve as crucial components in assessing performance of LLMs. Then, we summarize the success and failure cases of LLMs in different tasks. Finally, we shed light on several future challenges that lie ahead in LLMs evaluation. Our aim is to offer invaluable insights to researchers in the realm of LLMs evaluation, thereby aiding the development of more proficient LLMs. Our key point is that evaluation should be treated as an essential discipline to better assist the development of LLMs. We consistently maintain the related open-source materials at: https://github.com/MLGroupJLU/LLM-eval-survey.