MEGAVERSE：跨語言、模式、模型和任務的大型語言模型基準測試

摘要

最近，大型語言模型（LLMs）的研究取得了快速進展，在幾個自然語言處理（NLP）任務中取得了顯著進展。因此，LLM評估研究激增，以了解模型的能力和限制。然而，許多研究都僅限於英語，導致非英語語言的LLM構建和評估相對未被探索。近期推出了幾款新的LLMs，需要在非英語語言上進行評估。本研究旨在通過新增六個新數據集，構建MEGAVERSE基準套件，擴展我們的MEGA基準套件。該基準包括22個數據集，涵蓋81種語言，包括資源稀缺的非洲語言。我們在MEGAVERSE數據集上評估了幾款最先進的LLMs，如GPT-3.5-Turbo、GPT4、PaLM2和Llama2。此外，我們在基準中包含了兩個多模態數據集，並評估了LLaVa-v1.5模型的性能。我們的實驗表明，GPT4和PaLM2在各種任務上優於Llama模型，特別是在資源稀缺語言上，GPT4在更多數據集上優於PaLM2。然而，必須解決數據污染等問題，以獲得對LLM在非英語語言上性能的準確評估。

English

Recently, there has been a rapid advancement in research on Large Language Models (LLMs), resulting in significant progress in several Natural Language Processing (NLP) tasks. Consequently, there has been a surge in LLM evaluation research to comprehend the models' capabilities and limitations. However, much of this research has been confined to the English language, leaving LLM building and evaluation for non-English languages relatively unexplored. There has been an introduction of several new LLMs, necessitating their evaluation on non-English languages. This study aims to expand our MEGA benchmarking suite by including six new datasets to form the MEGAVERSE benchmark. The benchmark comprises 22 datasets covering 81 languages, including low-resource African languages. We evaluate several state-of-the-art LLMs like GPT-3.5-Turbo, GPT4, PaLM2, and Llama2 on the MEGAVERSE datasets. Additionally, we include two multimodal datasets in the benchmark and assess the performance of the LLaVa-v1.5 model. Our experiments suggest that GPT4 and PaLM2 outperform the Llama models on various tasks, notably on low-resource languages, with GPT4 outperforming PaLM2 on more datasets than vice versa. However, issues such as data contamination must be addressed to obtain an accurate assessment of LLM performance on non-English languages.

MEGAVERSE：跨語言、模式、模型和任務的大型語言模型基準測試

MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks

摘要

Support