多語言編碼器所知超乎你想像:極低資源語言的共享權重預訓練
Multilingual Encoder Knows more than You Realize: Shared Weights Pretraining for Extremely Low-Resource Languages
February 15, 2025
作者: Zeli Su, Ziyin Zhang, Guixian Xu, Jianing Liu, XU Han, Ting Zhang, Yushuang Dong
cs.AI
摘要
儘管如XLM-R這類多語言模型在自然語言處理(NLP)領域推動了多語言能力的進步,它們在極低資源語言上的表現仍顯不足。這一問題因現代大型語言模型(如LLaMA和Qwen)所支持的語言數量遠少於XLM-R而更加嚴峻,導致許多世界語言的文本生成模型幾乎不存在。為應對這一挑戰,我們提出了一種新穎的框架,旨在將多語言編碼器適應於極低資源語言的文本生成任務。通過重複利用編碼器與解碼器之間的權重,我們的框架使模型能夠利用編碼器已學習的語義空間,從而實現低資源語言下的高效學習與有效泛化。將此框架應用於四種中國少數民族語言,我們推出了XLM-SWCM,並展示了其在多種下游任務上相較於更大模型的優越性能。
English
While multilingual language models like XLM-R have advanced multilingualism
in NLP, they still perform poorly in extremely low-resource languages. This
situation is exacerbated by the fact that modern LLMs such as LLaMA and Qwen
support far fewer languages than XLM-R, making text generation models
non-existent for many languages in the world. To tackle this challenge, we
propose a novel framework for adapting multilingual encoders to text generation
in extremely low-resource languages. By reusing the weights between the encoder
and the decoder, our framework allows the model to leverage the learned
semantic space of the encoder, enabling efficient learning and effective
generalization in low-resource languages. Applying this framework to four
Chinese minority languages, we present XLM-SWCM, and demonstrate its superior
performance on various downstream tasks even when compared with much larger
models.Summary
AI-Generated Summary