跨語言詞彙轉換和跨語言詞彙轉移：LLM的語言適應在低資源自然語言處理中

摘要

對於低資源和中資源語言，開發單語言語言模型仍然受到來源高質量訓練數據的困難所限。在這項研究中，我們提出了一種新穎的跨語言詞彙轉移策略，即跨標記化，旨在應對這一挑戰，實現更有效的語言適應。我們的方法著重於將高資源的單語言語言模型適應到一個未見過的目標語言，方法是通過使用來源語言中語義相似的標記嵌入的加權平均值來初始化目標語言的標記嵌入。為此，我們利用了涵蓋來源語言和目標語言的翻譯資源。我們通過Tweeties驗證了我們的方法，這是一系列跨標記化的單語言語言模型，展示了它們在一小但多樣化語言集合上各種下游任務中的競爭性表現。此外，我們介紹了Hydra LLMs，這是具有多個可交換語言建模頭部和嵌入表的模型，進一步擴展了我們的跨標記化策略的功能。通過基於多語言模型TowerInstruct設計Hydra LLM，我們以零樣本方式為韃靼語開發了一個最先進的機器翻譯模型，完全繞過了高質量平行數據的需求。這一突破對於像韃靼語這樣的低資源語言尤為重要，因為高質量平行數據難以獲得。通過降低訓練高質量模型的數據和時間需求，我們的跨標記化策略允許開發更多語言的單語言語言模型，尤其是那些資源有限的語言。我們希望我們的工作能激發跨語言詞彙轉移領域的進一步研究和合作，並有助於全球語言的發展。

English

The development of monolingual language models for low and mid-resource languages continues to be hindered by the difficulty in sourcing high-quality training data. In this study, we present a novel cross-lingual vocabulary transfer strategy, trans-tokenization, designed to tackle this challenge and enable more efficient language adaptation. Our approach focuses on adapting a high-resource monolingual LLM to an unseen target language by initializing the token embeddings of the target language using a weighted average of semantically similar token embeddings from the source language. For this, we leverage a translation resource covering both the source and target languages. We validate our method with the Tweeties, a series of trans-tokenized LLMs, and demonstrate their competitive performance on various downstream tasks across a small but diverse set of languages. Additionally, we introduce Hydra LLMs, models with multiple swappable language modeling heads and embedding tables, which further extend the capabilities of our trans-tokenization strategy. By designing a Hydra LLM based on the multilingual model TowerInstruct, we developed a state-of-the-art machine translation model for Tatar, in a zero-shot manner, completely bypassing the need for high-quality parallel data. This breakthrough is particularly significant for low-resource languages like Tatar, where high-quality parallel data is hard to come by. By lowering the data and time requirements for training high-quality models, our trans-tokenization strategy allows for the development of LLMs for a wider range of languages, especially those with limited resources. We hope that our work will inspire further research and collaboration in the field of cross-lingual vocabulary transfer and contribute to the empowerment of languages on a global scale.

跨語言詞彙轉換和跨語言詞彙轉移：LLM的語言適應在低資源自然語言處理中

Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP

摘要

Support