MEGAVERSE: 言語、モダリティ、モデル、タスクにわたる大規模言語モデルのベンチマーキング

要旨

近年、大規模言語モデル（LLM）の研究が急速に進展し、自然言語処理（NLP）の複数のタスクにおいて大きな進歩が見られています。これに伴い、LLMの能力と限界を理解するための評価研究も急増しています。しかし、その多くは英語に限定されており、非英語言語におけるLLMの構築と評価は比較的未開拓のままです。新たにいくつかのLLMが導入されたことで、非英語言語での評価が必要とされています。本研究では、MEGAベンチマークスイートを拡張し、6つの新しいデータセットを含むMEGAVERSEベンチマークを形成することを目指しています。このベンチマークは、低リソースのアフリカ言語を含む81言語をカバーする22のデータセットで構成されています。GPT-3.5-Turbo、GPT4、PaLM2、Llama2といった最先端のLLMをMEGAVERSEデータセットで評価します。さらに、ベンチマークに2つのマルチモーダルデータセットを含め、LLaVa-v1.5モデルの性能を評価します。実験結果から、GPT4とPaLM2が特に低リソース言語においてLlamaモデルを上回り、GPT4がPaLM2よりも多くのデータセットで優れていることが示唆されています。ただし、非英語言語におけるLLMの性能を正確に評価するためには、データ汚染などの問題に対処する必要があります。

English

Recently, there has been a rapid advancement in research on Large Language Models (LLMs), resulting in significant progress in several Natural Language Processing (NLP) tasks. Consequently, there has been a surge in LLM evaluation research to comprehend the models' capabilities and limitations. However, much of this research has been confined to the English language, leaving LLM building and evaluation for non-English languages relatively unexplored. There has been an introduction of several new LLMs, necessitating their evaluation on non-English languages. This study aims to expand our MEGA benchmarking suite by including six new datasets to form the MEGAVERSE benchmark. The benchmark comprises 22 datasets covering 81 languages, including low-resource African languages. We evaluate several state-of-the-art LLMs like GPT-3.5-Turbo, GPT4, PaLM2, and Llama2 on the MEGAVERSE datasets. Additionally, we include two multimodal datasets in the benchmark and assess the performance of the LLaVa-v1.5 model. Our experiments suggest that GPT4 and PaLM2 outperform the Llama models on various tasks, notably on low-resource languages, with GPT4 outperforming PaLM2 on more datasets than vice versa. However, issues such as data contamination must be addressed to obtain an accurate assessment of LLM performance on non-English languages.

MEGAVERSE: 言語、モダリティ、モデル、タスクにわたる大規模言語モデルのベンチマーキング

MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks

要旨

Support