對大型語言模型評估的調查

A Survey on Evaluation of Large Language Models

July 6, 2023

作者: Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Kaijie Zhu, Hao Chen, Linyi Yang, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, Xing Xie

cs.AI

摘要

大型語言模型（LLMs）由於在各種應用中表現出色，正受到學術界和工業界日益增長的青睞。隨著LLMs在研究和日常使用中持續發揮著重要作用，對其進行評估變得日益關鍵，不僅在任務層面上，還在社會層面上，以更好地了解其潛在風險。過去幾年來，人們已經做出了重大努力，從各種角度檢驗LLMs。本文全面回顧了這些LLMs評估方法，重點關注三個關鍵維度：評估什麼、在哪裡評估以及如何評估。首先，我們從評估任務的角度提供了一個概述，包括一般自然語言處理任務、推理、醫療用途、倫理、教育、自然和社會科學、代理應用等各個領域。其次，我們通過深入研究評估方法和基準來回答“在哪裡”和“如何”這兩個問題，這些是評估LLMs性能的關鍵組成部分。然後，我們總結了LLMs在不同任務中的成功和失敗案例。最後，我們闡明了LLMs評估面臨的幾個未來挑戰。我們的目標是為LLMs評估領域的研究人員提供寶貴的見解，從而促進更加高效的LLMs發展。我們的關鍵觀點是，評估應被視為促進LLMs發展的一門重要學科。我們一貫地在以下鏈接中維護相關的開源材料：https://github.com/MLGroupJLU/LLM-eval-survey。

English

Large language models (LLMs) are gaining increasing popularity in both academia and industry, owing to their unprecedented performance in various applications. As LLMs continue to play a vital role in both research and daily use, their evaluation becomes increasingly critical, not only at the task level, but also at the society level for better understanding of their potential risks. Over the past years, significant efforts have been made to examine LLMs from various perspectives. This paper presents a comprehensive review of these evaluation methods for LLMs, focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate. Firstly, we provide an overview from the perspective of evaluation tasks, encompassing general natural language processing tasks, reasoning, medical usage, ethics, educations, natural and social sciences, agent applications, and other areas. Secondly, we answer the `where' and `how' questions by diving into the evaluation methods and benchmarks, which serve as crucial components in assessing performance of LLMs. Then, we summarize the success and failure cases of LLMs in different tasks. Finally, we shed light on several future challenges that lie ahead in LLMs evaluation. Our aim is to offer invaluable insights to researchers in the realm of LLMs evaluation, thereby aiding the development of more proficient LLMs. Our key point is that evaluation should be treated as an essential discipline to better assist the development of LLMs. We consistently maintain the related open-source materials at: https://github.com/MLGroupJLU/LLM-eval-survey.