ChatPaper.aiChatPaper

MEXA:通过跨语言对齐评估以英语为中心的LLM多语言模型

MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment

October 8, 2024
作者: Amir Hossein Kargaran, Ali Modarressi, Nafiseh Nikeghbal, Jana Diesner, François Yvon, Hinrich Schütze
cs.AI

摘要

以英语为中心的大型语言模型(LLMs)通常展现出强大的多语言能力。然而,这些模型的多语言性能仍不清楚,并且对许多语言没有进行彻底评估。大多数用于多语言性能评估的基准测试侧重于经典自然语言处理任务,或者涵盖了少量语言。我们引入了MEXA,一种用于评估预训练的以英语为中心的LLMs多语言能力的方法,该方法使用平行句子进行评估,这些句子比现有的下游任务涵盖的语言更多。MEXA利用了以英语为中间语言的事实,英语为中心的LLMs在其中间层中使用英语作为一种枢纽语言。它通过使用平行句子计算英语和非英语语言之间的对齐来评估从英语到其他语言的语言理解转移。这种对齐可以用来估计模型在其他语言中的性能。我们使用各种平行数据集(FLORES-200和Bible)、模型(Llama家族、Gemma家族、Mistral和OLMo)以及已建立的下游任务(Belebele、m-MMLU和m-ARC)进行研究。我们探索了在仅解码器模型中计算嵌入的不同方法。我们的结果显示,在其默认设置下,MEXA在九个模型和两个平行数据集上与三个已建立的下游任务实现了具有统计显著性的平均皮尔逊相关系数0.90。这表明MEXA是一种可靠的方法,用于估计以英语为中心的LLMs的多语言能力,从而更清晰地了解它们的多语言潜力和LLMs的内部运作。排行榜:https://huggingface.co/spaces/cis-lmu/Mexa,代码:https://github.com/cisnlp/Mexa。
English
English-centric large language models (LLMs) often show strong multilingual capabilities. However, the multilingual performance of these models remains unclear and is not thoroughly evaluated for many languages. Most benchmarks for multilinguality focus on classic NLP tasks, or cover a minimal number of languages. We introduce MEXA, a method for assessing the multilingual capabilities of pre-trained English-centric LLMs using parallel sentences, which are available for more languages than existing downstream tasks. MEXA leverages the fact that English-centric LLMs use English as a kind of pivot language in their intermediate layers. It computes the alignment between English and non-English languages using parallel sentences to evaluate the transfer of language understanding from English to other languages. This alignment can be used to estimate model performance in other languages. We conduct studies using various parallel datasets (FLORES-200 and Bible), models (Llama family, Gemma family, Mistral, and OLMo), and established downstream tasks (Belebele, m-MMLU, and m-ARC). We explore different methods to compute embeddings in decoder-only models. Our results show that MEXA, in its default settings, achieves a statistically significant average Pearson correlation of 0.90 with three established downstream tasks across nine models and two parallel datasets. This suggests that MEXA is a reliable method for estimating the multilingual capabilities of English-centric LLMs, providing a clearer understanding of their multilingual potential and the inner workings of LLMs. Leaderboard: https://huggingface.co/spaces/cis-lmu/Mexa, Code: https://github.com/cisnlp/Mexa.

Summary

AI-Generated Summary

PDF32November 16, 2024