MEXA: 다국어 평가를 통한 영어 중심 LLM의 다국어 간 상호 정렬

초록

영어 중심의 대형 언어 모델(LLMs)은 종종 강력한 다국어 능력을 보여줍니다. 그러나 이러한 모델의 다국어 성능은 여전히 명확하지 않으며 많은 언어에 대해 철저히 평가되지 않았습니다. 대부분의 다국어성 벤치마크는 고전적인 자연어 처리(NLP) 작업에 초점을 맞추거나 소수의 언어만을 다루고 있습니다. 우리는 MEXA를 소개합니다. 이는 병렬 문장을 사용하여 사전 훈련된 영어 중심 LLMs의 다국어 능력을 평가하는 방법으로, 기존의 하류 작업보다 더 많은 언어에 대해 사용 가능합니다. MEXA는 영어 중심 LLMs가 중간 레이어에서 영어를 한 종류의 중심 언어로 사용한다는 사실을 활용합니다. 이는 영어와 비영어 언어 간의 정렬을 병렬 문장을 사용하여 계산하여 영어에서 다른 언어로의 언어 이해 전이를 평가합니다. 이 정렬은 다른 언어에서 모델 성능을 추정하는 데 사용될 수 있습니다. 우리는 다양한 병렬 데이터셋(FLORES-200 및 성경), 모델(Llama 패밀리, Gemma 패밀리, Mistral 및 OLMo) 및 확립된 하류 작업(Belebele, m-MMLU 및 m-ARC)을 사용하여 연구를 수행합니다. 디코더 전용 모델에서 임베딩을 계산하는 다양한 방법을 탐구합니다. 우리의 결과는 MEXA가 기본 설정에서 9개 모델과 2개 병렬 데이터셋을 통해 3개의 확립된 하류 작업과 통계적으로 유의한 평균 피어슨 상관 관계 0.90을 달성한다는 것을 보여줍니다. 이는 MEXA가 영어 중심 LLMs의 다국어 능력을 추정하는 믿을 만한 방법이며, 그들의 다국어 잠재력과 LLMs의 내부 작동에 대한 더 명확한 이해를 제공합니다. 리더보드: https://huggingface.co/spaces/cis-lmu/Mexa, 코드: https://github.com/cisnlp/Mexa.

English

English-centric large language models (LLMs) often show strong multilingual capabilities. However, the multilingual performance of these models remains unclear and is not thoroughly evaluated for many languages. Most benchmarks for multilinguality focus on classic NLP tasks, or cover a minimal number of languages. We introduce MEXA, a method for assessing the multilingual capabilities of pre-trained English-centric LLMs using parallel sentences, which are available for more languages than existing downstream tasks. MEXA leverages the fact that English-centric LLMs use English as a kind of pivot language in their intermediate layers. It computes the alignment between English and non-English languages using parallel sentences to evaluate the transfer of language understanding from English to other languages. This alignment can be used to estimate model performance in other languages. We conduct studies using various parallel datasets (FLORES-200 and Bible), models (Llama family, Gemma family, Mistral, and OLMo), and established downstream tasks (Belebele, m-MMLU, and m-ARC). We explore different methods to compute embeddings in decoder-only models. Our results show that MEXA, in its default settings, achieves a statistically significant average Pearson correlation of 0.90 with three established downstream tasks across nine models and two parallel datasets. This suggests that MEXA is a reliable method for estimating the multilingual capabilities of English-centric LLMs, providing a clearer understanding of their multilingual potential and the inner workings of LLMs. Leaderboard: https://huggingface.co/spaces/cis-lmu/Mexa, Code: https://github.com/cisnlp/Mexa.

MEXA: 다국어 평가를 통한 영어 중심 LLM의 다국어 간 상호 정렬

MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment

초록

Support