大規模言語モデルの評価に関するサーベイ

要旨

大規模言語モデル（LLMs）は、様々なアプリケーションにおいて前例のない性能を発揮していることから、学界と産業界の両方でますます人気を集めています。LLMsが研究と日常使用の両方で重要な役割を果たし続ける中、その評価はタスクレベルだけでなく、潜在的なリスクをよりよく理解するための社会レベルでもますます重要になっています。過去数年間、LLMsを様々な視点から検証するための多大な努力が払われてきました。本論文では、LLMsの評価方法に関する包括的なレビューを提供し、何を評価するか、どこで評価するか、そしてどのように評価するかという3つの主要な次元に焦点を当てます。まず、評価タスクの観点から、一般的な自然言語処理タスク、推論、医療用途、倫理、教育、自然科学と社会科学、エージェントアプリケーション、その他の領域を含む概要を提供します。次に、`どこで'と`どのように'という質問に答えるために、LLMsの性能を評価する上で重要な要素である評価方法とベンチマークに深く掘り下げます。その後、様々なタスクにおけるLLMsの成功と失敗の事例をまとめます。最後に、LLMs評価の将来の課題について明らかにします。私たちの目的は、LLMs評価の分野の研究者に貴重な洞察を提供し、より熟練したLLMsの開発を支援することです。私たちの主要なポイントは、評価をLLMsの開発をよりよく支援するための必須の学問として扱うべきだということです。関連するオープンソース資料は以下のURLで一貫して維持しています：https://github.com/MLGroupJLU/LLM-eval-survey。

English

Large language models (LLMs) are gaining increasing popularity in both academia and industry, owing to their unprecedented performance in various applications. As LLMs continue to play a vital role in both research and daily use, their evaluation becomes increasingly critical, not only at the task level, but also at the society level for better understanding of their potential risks. Over the past years, significant efforts have been made to examine LLMs from various perspectives. This paper presents a comprehensive review of these evaluation methods for LLMs, focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate. Firstly, we provide an overview from the perspective of evaluation tasks, encompassing general natural language processing tasks, reasoning, medical usage, ethics, educations, natural and social sciences, agent applications, and other areas. Secondly, we answer the `where' and `how' questions by diving into the evaluation methods and benchmarks, which serve as crucial components in assessing performance of LLMs. Then, we summarize the success and failure cases of LLMs in different tasks. Finally, we shed light on several future challenges that lie ahead in LLMs evaluation. Our aim is to offer invaluable insights to researchers in the realm of LLMs evaluation, thereby aiding the development of more proficient LLMs. Our key point is that evaluation should be treated as an essential discipline to better assist the development of LLMs. We consistently maintain the related open-source materials at: https://github.com/MLGroupJLU/LLM-eval-survey.

大規模言語モデルの評価に関するサーベイ

A Survey on Evaluation of Large Language Models

要旨

Support