對大型語言模型評估的調查
A Survey on Evaluation of Large Language Models
July 6, 2023
作者: Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Kaijie Zhu, Hao Chen, Linyi Yang, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, Xing Xie
cs.AI
摘要
大型語言模型(LLMs)由於在各種應用中表現出色,正受到學術界和工業界日益增長的青睞。隨著LLMs在研究和日常使用中持續發揮著重要作用,對其進行評估變得日益關鍵,不僅在任務層面上,還在社會層面上,以更好地了解其潛在風險。過去幾年來,人們已經做出了重大努力,從各種角度檢驗LLMs。本文全面回顧了這些LLMs評估方法,重點關注三個關鍵維度:評估什麼、在哪裡評估以及如何評估。首先,我們從評估任務的角度提供了一個概述,包括一般自然語言處理任務、推理、醫療用途、倫理、教育、自然和社會科學、代理應用等各個領域。其次,我們通過深入研究評估方法和基準來回答“在哪裡”和“如何”這兩個問題,這些是評估LLMs性能的關鍵組成部分。然後,我們總結了LLMs在不同任務中的成功和失敗案例。最後,我們闡明了LLMs評估面臨的幾個未來挑戰。我們的目標是為LLMs評估領域的研究人員提供寶貴的見解,從而促進更加高效的LLMs發展。我們的關鍵觀點是,評估應被視為促進LLMs發展的一門重要學科。我們一貫地在以下鏈接中維護相關的開源材料:https://github.com/MLGroupJLU/LLM-eval-survey。
English
Large language models (LLMs) are gaining increasing popularity in both
academia and industry, owing to their unprecedented performance in various
applications. As LLMs continue to play a vital role in both research and daily
use, their evaluation becomes increasingly critical, not only at the task
level, but also at the society level for better understanding of their
potential risks. Over the past years, significant efforts have been made to
examine LLMs from various perspectives. This paper presents a comprehensive
review of these evaluation methods for LLMs, focusing on three key dimensions:
what to evaluate, where to evaluate, and how to evaluate. Firstly, we provide
an overview from the perspective of evaluation tasks, encompassing general
natural language processing tasks, reasoning, medical usage, ethics,
educations, natural and social sciences, agent applications, and other areas.
Secondly, we answer the `where' and `how' questions by diving into the
evaluation methods and benchmarks, which serve as crucial components in
assessing performance of LLMs. Then, we summarize the success and failure cases
of LLMs in different tasks. Finally, we shed light on several future challenges
that lie ahead in LLMs evaluation. Our aim is to offer invaluable insights to
researchers in the realm of LLMs evaluation, thereby aiding the development of
more proficient LLMs. Our key point is that evaluation should be treated as an
essential discipline to better assist the development of LLMs. We consistently
maintain the related open-source materials at:
https://github.com/MLGroupJLU/LLM-eval-survey.