Belebele基準:包含122種語言變體的平行閱讀理解數據集

The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants

August 31, 2023
作者: Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, Madian Khabsa
cs.AI

摘要

我們介紹了 Belebele,一個涵蓋 122 種語言變體的多選機器閱讀理解(MRC)數據集。該數據集顯著擴展了自然語言理解(NLU)基準的語言覆蓋範圍,使得可以評估文本模型在高、中、低資源語言中的表現。每個問題基於 Flores-200 數據集中的一個短篇章,並包含四個多選答案。這些問題經過精心挑選,可以區分具有不同通用語言理解水平的模型。單單英文數據集就足以挑戰最先進的語言模型。這個數據集是完全平行的,可以直接比較各種語言上模型的表現。我們使用這個數據集來評估多語言遮罩語言模型(MLMs)和大型語言模型(LLMs)的能力。我們提出了廣泛的結果,發現儘管以英語為中心的LLMs具有顯著的跨語言轉移能力,但在平衡的多語言數據上預訓練的規模較小的MLMs仍然理解更多語言。我們還觀察到更大的詞彙量和有意識的詞彙構建與低資源語言上更好的表現有關。總的來說,Belebele為評估和分析自然語言處理系統的多語言能力開辟了新途徑。
English
We present Belebele, a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. Significantly expanding the language coverage of natural language understanding (NLU) benchmarks, this dataset enables the evaluation of text models in high-, medium-, and low-resource languages. Each question is based on a short passage from the Flores-200 dataset and has four multiple-choice answers. The questions were carefully curated to discriminate between models with different levels of general language comprehension. The English dataset on its own proves difficult enough to challenge state-of-the-art language models. Being fully parallel, this dataset enables direct comparison of model performance across all languages. We use this dataset to evaluate the capabilities of multilingual masked language models (MLMs) and large language models (LLMs). We present extensive results and find that despite significant cross-lingual transfer in English-centric LLMs, much smaller MLMs pretrained on balanced multilingual data still understand far more languages. We also observe that larger vocabulary size and conscious vocabulary construction correlate with better performance on low-resource languages. Overall, Belebele opens up new avenues for evaluating and analyzing the multilingual capabilities of NLP systems.

Summary

AI-Generated Summary

PDF100December 15, 2024