Belebele基准:一个包含122种语言变体的平行阅读理解数据集

The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants

August 31, 2023
作者: Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, Madian Khabsa
cs.AI

摘要

我们介绍了Belebele,一个涵盖122种语言变体的多项选择机器阅读理解(MRC)数据集。显著扩展了自然语言理解(NLU)基准的语言覆盖范围,该数据集使得可以评估文本模型在高、中、低资源语言中的表现。每个问题基于Flores-200数据集中的一个简短段落,并包含四个多项选择答案。这些问题经过精心筛选,旨在区分具有不同一般语言理解水平的模型。仅英语数据集就足以挑战最先进的语言模型。作为完全平行的数据集,它使得可以直接比较所有语言中模型的表现。我们使用该数据集来评估多语言遮蔽语言模型(MLMs)和大型语言模型(LLMs)的能力。我们提出了广泛的结果,并发现尽管以英语为中心的LLMs具有显著的跨语言转移能力,但在平衡的多语言数据上预训练的较小MLMs仍然理解更多语言。我们还观察到更大的词汇量和有意识的词汇构建与低资源语言上的更好表现相关。总体而言,Belebele为评估和分析NLP系统的多语言能力开辟了新的途径。
English
We present Belebele, a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. Significantly expanding the language coverage of natural language understanding (NLU) benchmarks, this dataset enables the evaluation of text models in high-, medium-, and low-resource languages. Each question is based on a short passage from the Flores-200 dataset and has four multiple-choice answers. The questions were carefully curated to discriminate between models with different levels of general language comprehension. The English dataset on its own proves difficult enough to challenge state-of-the-art language models. Being fully parallel, this dataset enables direct comparison of model performance across all languages. We use this dataset to evaluate the capabilities of multilingual masked language models (MLMs) and large language models (LLMs). We present extensive results and find that despite significant cross-lingual transfer in English-centric LLMs, much smaller MLMs pretrained on balanced multilingual data still understand far more languages. We also observe that larger vocabulary size and conscious vocabulary construction correlate with better performance on low-resource languages. Overall, Belebele opens up new avenues for evaluating and analyzing the multilingual capabilities of NLP systems.

Summary

AI-Generated Summary

PDF100December 15, 2024