MEGAVERSE:跨語言、模式、模型和任務的大型語言模型基準測試
MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks
November 13, 2023
作者: Sanchit Ahuja, Divyanshu Aggarwal, Varun Gumma, Ishaan Watts, Ashutosh Sathe, Millicent Ochieng, Rishav Hada, Prachi Jain, Maxamed Axmed, Kalika Bali, Sunayana Sitaram
cs.AI
摘要
最近,大型語言模型(LLMs)的研究取得了快速進展,在幾個自然語言處理(NLP)任務中取得了顯著進展。因此,LLM評估研究激增,以了解模型的能力和限制。然而,許多研究都僅限於英語,導致非英語語言的LLM構建和評估相對未被探索。近期推出了幾款新的LLMs,需要在非英語語言上進行評估。本研究旨在通過新增六個新數據集,構建MEGAVERSE基準套件,擴展我們的MEGA基準套件。該基準包括22個數據集,涵蓋81種語言,包括資源稀缺的非洲語言。我們在MEGAVERSE數據集上評估了幾款最先進的LLMs,如GPT-3.5-Turbo、GPT4、PaLM2和Llama2。此外,我們在基準中包含了兩個多模態數據集,並評估了LLaVa-v1.5模型的性能。我們的實驗表明,GPT4和PaLM2在各種任務上優於Llama模型,特別是在資源稀缺語言上,GPT4在更多數據集上優於PaLM2。然而,必須解決數據污染等問題,以獲得對LLM在非英語語言上性能的準確評估。
English
Recently, there has been a rapid advancement in research on Large Language
Models (LLMs), resulting in significant progress in several Natural Language
Processing (NLP) tasks. Consequently, there has been a surge in LLM evaluation
research to comprehend the models' capabilities and limitations. However, much
of this research has been confined to the English language, leaving LLM
building and evaluation for non-English languages relatively unexplored. There
has been an introduction of several new LLMs, necessitating their evaluation on
non-English languages. This study aims to expand our MEGA benchmarking suite by
including six new datasets to form the MEGAVERSE benchmark. The benchmark
comprises 22 datasets covering 81 languages, including low-resource African
languages. We evaluate several state-of-the-art LLMs like GPT-3.5-Turbo, GPT4,
PaLM2, and Llama2 on the MEGAVERSE datasets. Additionally, we include two
multimodal datasets in the benchmark and assess the performance of the
LLaVa-v1.5 model. Our experiments suggest that GPT4 and PaLM2 outperform the
Llama models on various tasks, notably on low-resource languages, with GPT4
outperforming PaLM2 on more datasets than vice versa. However, issues such as
data contamination must be addressed to obtain an accurate assessment of LLM
performance on non-English languages.