多語言編碼器所知超乎你想像：極低資源語言的共享權重預訓練

摘要

儘管如XLM-R這類多語言模型在自然語言處理（NLP）領域推動了多語言能力的進步，它們在極低資源語言上的表現仍顯不足。這一問題因現代大型語言模型（如LLaMA和Qwen）所支持的語言數量遠少於XLM-R而更加嚴峻，導致許多世界語言的文本生成模型幾乎不存在。為應對這一挑戰，我們提出了一種新穎的框架，旨在將多語言編碼器適應於極低資源語言的文本生成任務。通過重複利用編碼器與解碼器之間的權重，我們的框架使模型能夠利用編碼器已學習的語義空間，從而實現低資源語言下的高效學習與有效泛化。將此框架應用於四種中國少數民族語言，我們推出了XLM-SWCM，並展示了其在多種下游任務上相較於更大模型的優越性能。

English

While multilingual language models like XLM-R have advanced multilingualism in NLP, they still perform poorly in extremely low-resource languages. This situation is exacerbated by the fact that modern LLMs such as LLaMA and Qwen support far fewer languages than XLM-R, making text generation models non-existent for many languages in the world. To tackle this challenge, we propose a novel framework for adapting multilingual encoders to text generation in extremely low-resource languages. By reusing the weights between the encoder and the decoder, our framework allows the model to leverage the learned semantic space of the encoder, enabling efficient learning and effective generalization in low-resource languages. Applying this framework to four Chinese minority languages, we present XLM-SWCM, and demonstrate its superior performance on various downstream tasks even when compared with much larger models.

多語言編碼器所知超乎你想像：極低資源語言的共享權重預訓練

Multilingual Encoder Knows more than You Realize: Shared Weights Pretraining for Extremely Low-Resource Languages

摘要

Support