MEXA: クロスリンガルアラインメントを介した英語中心のLLMの多言語評価

要旨

英語中心の大規模言語モデル（LLM）はしばしば強力な多言語能力を示します。ただし、これらのモデルの多言語性能は依然として不明確であり、多くの言語について徹底的に評価されていません。多言語性のほとんどのベンチマークは、古典的な自然言語処理（NLP）タスクに焦点を当てているか、ごく少数の言語をカバーしています。私たちは、MEXAという、既存のダウンストリームタスクよりも多言語に利用可能な平行文を使用して、事前学習された英語中心のLLMの多言語能力を評価する方法を紹介します。MEXAは、英語中心のLLMが中間層で英語を一種の枢軸言語として使用しているという事実を活用しています。MEXAは、英語と非英語の言語との間のアラインメントを計算し、平行文を使用して英語から他の言語への言語理解の転送を評価します。このアラインメントは、他の言語でのモデルの性能を推定するために使用できます。私たちは、さまざまな平行データセット（FLORES-200およびBible）、モデル（Llamaファミリー、Gemmaファミリー、Mistral、OLMo）、および確立されたダウンストリームタスク（Belebele、m-MMLU、m-ARC）を使用して研究を行います。デコーダーのみのモデルで埋め込みを計算するための異なる方法を探ります。私たちの結果は、MEXAがデフォルト設定で、9つのモデルと2つの平行データセットにわたる3つの確立されたダウンストリームタスクとの平均ピアソン相関係数0.90を統計的に有意な水準で達成することを示しています。これは、MEXAが英語中心のLLMの多言語能力を推定するための信頼性のある方法であり、彼らの多言語潜在能力とLLMの内部機能をより明確に理解する手助けをしています。リーダーボード: https://huggingface.co/spaces/cis-lmu/Mexa、コード: https://github.com/cisnlp/Mexa.

English

English-centric large language models (LLMs) often show strong multilingual capabilities. However, the multilingual performance of these models remains unclear and is not thoroughly evaluated for many languages. Most benchmarks for multilinguality focus on classic NLP tasks, or cover a minimal number of languages. We introduce MEXA, a method for assessing the multilingual capabilities of pre-trained English-centric LLMs using parallel sentences, which are available for more languages than existing downstream tasks. MEXA leverages the fact that English-centric LLMs use English as a kind of pivot language in their intermediate layers. It computes the alignment between English and non-English languages using parallel sentences to evaluate the transfer of language understanding from English to other languages. This alignment can be used to estimate model performance in other languages. We conduct studies using various parallel datasets (FLORES-200 and Bible), models (Llama family, Gemma family, Mistral, and OLMo), and established downstream tasks (Belebele, m-MMLU, and m-ARC). We explore different methods to compute embeddings in decoder-only models. Our results show that MEXA, in its default settings, achieves a statistically significant average Pearson correlation of 0.90 with three established downstream tasks across nine models and two parallel datasets. This suggests that MEXA is a reliable method for estimating the multilingual capabilities of English-centric LLMs, providing a clearer understanding of their multilingual potential and the inner workings of LLMs. Leaderboard: https://huggingface.co/spaces/cis-lmu/Mexa, Code: https://github.com/cisnlp/Mexa.

MEXA: クロスリンガルアラインメントを介した英語中心のLLMの多言語評価

MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment

要旨

Support