MEGAVERSE：在不同语言、模态、模型和任务上对大型语言模型进行基准测试

摘要

近年来，大型语言模型（LLMs）的研究取得了快速进展，在几个自然语言处理（NLP）任务中取得了显著进展。因此，LLM评估研究激增，以了解模型的能力和局限性。然而，大部分研究仅限于英语，导致非英语语言的LLM构建和评估相对未被探索。随着几种新的LLMs的推出，有必要对非英语语言进行评估。本研究旨在通过引入六个新数据集，扩展我们的MEGA基准套件，形成MEGAVERSE基准。该基准包括22个数据集，涵盖81种语言，包括资源匮乏的非洲语言。我们在MEGAVERSE数据集上评估了几种最先进的LLMs，如GPT-3.5-Turbo、GPT4、PaLM2和Llama2。此外，我们在基准中包含了两个多模态数据集，并评估了LLaVa-v1.5模型的性能。我们的实验表明，GPT4和PaLM2在各种任务上优于Llama模型，特别是在资源匮乏的语言上，其中GPT4在更多数据集上优于PaLM2。然而，必须解决数据污染等问题，以获得对LLM在非英语语言上性能的准确评估。

English

Recently, there has been a rapid advancement in research on Large Language Models (LLMs), resulting in significant progress in several Natural Language Processing (NLP) tasks. Consequently, there has been a surge in LLM evaluation research to comprehend the models' capabilities and limitations. However, much of this research has been confined to the English language, leaving LLM building and evaluation for non-English languages relatively unexplored. There has been an introduction of several new LLMs, necessitating their evaluation on non-English languages. This study aims to expand our MEGA benchmarking suite by including six new datasets to form the MEGAVERSE benchmark. The benchmark comprises 22 datasets covering 81 languages, including low-resource African languages. We evaluate several state-of-the-art LLMs like GPT-3.5-Turbo, GPT4, PaLM2, and Llama2 on the MEGAVERSE datasets. Additionally, we include two multimodal datasets in the benchmark and assess the performance of the LLaVa-v1.5 model. Our experiments suggest that GPT4 and PaLM2 outperform the Llama models on various tasks, notably on low-resource languages, with GPT4 outperforming PaLM2 on more datasets than vice versa. However, issues such as data contamination must be addressed to obtain an accurate assessment of LLM performance on non-English languages.

MEGAVERSE：在不同语言、模态、模型和任务上对大型语言模型进行基准测试

MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks

摘要

Support