MEXA:通過跨語言對齊對以英語為中心的LLM進行多語言評估
MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment
October 8, 2024
作者: Amir Hossein Kargaran, Ali Modarressi, Nafiseh Nikeghbal, Jana Diesner, François Yvon, Hinrich Schütze
cs.AI
摘要
以英語為中心的大型語言模型(LLMs)通常展現出強大的多語能力。然而,這些模型的多語表現仍不清楚,並且對許多語言尚未進行全面評估。大多數評估多語能力的基準著重於經典自然語言處理任務,或僅涵蓋少數語言。我們介紹了MEXA,一種評估預訓練的以英語為中心的LLMs多語能力的方法,使用平行句子進行評估,這些句子比現有的下游任務涵蓋更多的語言。MEXA利用了以英語作為中間層中的一種樞紐語言的事實。它通過使用平行句子計算英語和非英語語言之間的對齊,以評估從英語到其他語言的語言理解轉移。這種對齊可用於估計模型在其他語言中的表現。我們使用各種平行數據集(FLORES-200和Bible)、模型(Llama家族、Gemma家族、Mistral和OLMo)以及已建立的下游任務(Belebele、m-MMLU和m-ARC)進行研究。我們探索了在僅解碼器模型中計算嵌入的不同方法。我們的結果顯示,在默認設置下,MEXA在九個模型和兩個平行數據集上,與三個已建立的下游任務達到統計上顯著的平均皮爾遜相關係數0.90。這表明MEXA是一種可靠的方法,可用於估計以英語為中心的LLMs的多語能力,提供對其多語潛力和LLMs內部運作的更清晰理解。排行榜:https://huggingface.co/spaces/cis-lmu/Mexa,代碼:https://github.com/cisnlp/Mexa。
English
English-centric large language models (LLMs) often show strong multilingual
capabilities. However, the multilingual performance of these models remains
unclear and is not thoroughly evaluated for many languages. Most benchmarks for
multilinguality focus on classic NLP tasks, or cover a minimal number of
languages. We introduce MEXA, a method for assessing the multilingual
capabilities of pre-trained English-centric LLMs using parallel sentences,
which are available for more languages than existing downstream tasks. MEXA
leverages the fact that English-centric LLMs use English as a kind of pivot
language in their intermediate layers. It computes the alignment between
English and non-English languages using parallel sentences to evaluate the
transfer of language understanding from English to other languages. This
alignment can be used to estimate model performance in other languages. We
conduct studies using various parallel datasets (FLORES-200 and Bible), models
(Llama family, Gemma family, Mistral, and OLMo), and established downstream
tasks (Belebele, m-MMLU, and m-ARC). We explore different methods to compute
embeddings in decoder-only models. Our results show that MEXA, in its default
settings, achieves a statistically significant average Pearson correlation of
0.90 with three established downstream tasks across nine models and two
parallel datasets. This suggests that MEXA is a reliable method for estimating
the multilingual capabilities of English-centric LLMs, providing a clearer
understanding of their multilingual potential and the inner workings of LLMs.
Leaderboard: https://huggingface.co/spaces/cis-lmu/Mexa, Code:
https://github.com/cisnlp/Mexa.Summary
AI-Generated Summary