대규모 언어 모델 평가에 관한 연구 동향

초록

대규모 언어 모델(LLM)은 다양한 응용 분야에서 전례 없는 성능을 보이며 학계와 산업계 모두에서 점점 더 큰 인기를 얻고 있습니다. LLM이 연구와 일상 생활 모두에서 중요한 역할을 계속함에 따라, 그 평가는 단순히 작업 수준뿐만 아니라 잠재적 위험을 더 잘 이해하기 위한 사회적 수준에서도 점점 더 중요해지고 있습니다. 지난 몇 년 동안 LLM을 다양한 관점에서 검토하기 위한 상당한 노력이 이루어졌습니다. 이 논문은 LLM에 대한 이러한 평가 방법을 포괄적으로 검토하며, 무엇을 평가할지, 어디에서 평가할지, 어떻게 평가할지라는 세 가지 핵심 차원에 초점을 맞춥니다. 먼저, 일반적인 자연어 처리 작업, 추론, 의료 사용, 윤리, 교육, 자연 및 사회과학, 에이전트 응용 및 기타 영역을 포함한 평가 작업의 관점에서 개요를 제공합니다. 둘째, '어디에서'와 '어떻게'라는 질문에 답하기 위해 LLM의 성능을 평가하는 데 중요한 구성 요소인 평가 방법과 벤치마크를 심층적으로 살펴봅니다. 그런 다음, 다양한 작업에서 LLM의 성공과 실패 사례를 요약합니다. 마지막으로, LLM 평가에 앞서 놓인 몇 가지 미래의 과제를 조명합니다. 우리의 목표는 LLM 평가 분야의 연구자들에게 귀중한 통찰력을 제공하여 더 능숙한 LLM의 개발을 돕는 것입니다. 우리의 핵심 주장은 평가가 LLM의 개발을 더 잘 지원하기 위한 필수적인 학문으로 간주되어야 한다는 것입니다. 관련 오픈소스 자료는 https://github.com/MLGroupJLU/LLM-eval-survey에서 지속적으로 유지하고 있습니다.

English

Large language models (LLMs) are gaining increasing popularity in both academia and industry, owing to their unprecedented performance in various applications. As LLMs continue to play a vital role in both research and daily use, their evaluation becomes increasingly critical, not only at the task level, but also at the society level for better understanding of their potential risks. Over the past years, significant efforts have been made to examine LLMs from various perspectives. This paper presents a comprehensive review of these evaluation methods for LLMs, focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate. Firstly, we provide an overview from the perspective of evaluation tasks, encompassing general natural language processing tasks, reasoning, medical usage, ethics, educations, natural and social sciences, agent applications, and other areas. Secondly, we answer the `where' and `how' questions by diving into the evaluation methods and benchmarks, which serve as crucial components in assessing performance of LLMs. Then, we summarize the success and failure cases of LLMs in different tasks. Finally, we shed light on several future challenges that lie ahead in LLMs evaluation. Our aim is to offer invaluable insights to researchers in the realm of LLMs evaluation, thereby aiding the development of more proficient LLMs. Our key point is that evaluation should be treated as an essential discipline to better assist the development of LLMs. We consistently maintain the related open-source materials at: https://github.com/MLGroupJLU/LLM-eval-survey.

대규모 언어 모델 평가에 관한 연구 동향

A Survey on Evaluation of Large Language Models

초록

Support