跨语言词汇转移和跨语言标记化：LLM的语言适应在低资源NLP中的应用

摘要

针对低资源和中资源语言，开发单语言模型仍然受制于获取高质量训练数据的困难。在本研究中，我们提出了一种新颖的跨语言词汇转移策略，即跨标记化，旨在解决这一挑战，实现更高效的语言适应。我们的方法侧重于通过使用源语言中语义相似的标记嵌入的加权平均值，初始化目标语言的标记嵌入，从而使高资源的单语言大语言模型适应未见过的目标语言。为此，我们利用涵盖源语言和目标语言的翻译资源。我们通过Tweeties验证了我们的方法，这是一系列跨标记化的单语言大语言模型，并展示了它们在一组小而多样化的语言上各种下游任务中的竞争性表现。此外，我们引入了Hydra单语言大语言模型，这些模型具有多个可互换的语言建模头部和嵌入表，进一步扩展了我们的跨标记化策略的能力。通过基于多语言模型TowerInstruct设计Hydra单语言大语言模型，我们以零-shot方式为鞑靼语开发了一种最先进的机器翻译模型，完全绕过了高质量平行数据的需求。这一突破对于鞑靼语等低资源语言尤为重要，因为高质量平行数据难以获取。通过降低训练高质量模型所需的数据和时间要求，我们的跨标记化策略允许为更多语言开发单语言大语言模型，特别是那些资源有限的语言。我们希望我们的工作能激发跨语言词汇转移领域的进一步研究和合作，并有助于全球范围内语言的赋权。

English

The development of monolingual language models for low and mid-resource languages continues to be hindered by the difficulty in sourcing high-quality training data. In this study, we present a novel cross-lingual vocabulary transfer strategy, trans-tokenization, designed to tackle this challenge and enable more efficient language adaptation. Our approach focuses on adapting a high-resource monolingual LLM to an unseen target language by initializing the token embeddings of the target language using a weighted average of semantically similar token embeddings from the source language. For this, we leverage a translation resource covering both the source and target languages. We validate our method with the Tweeties, a series of trans-tokenized LLMs, and demonstrate their competitive performance on various downstream tasks across a small but diverse set of languages. Additionally, we introduce Hydra LLMs, models with multiple swappable language modeling heads and embedding tables, which further extend the capabilities of our trans-tokenization strategy. By designing a Hydra LLM based on the multilingual model TowerInstruct, we developed a state-of-the-art machine translation model for Tatar, in a zero-shot manner, completely bypassing the need for high-quality parallel data. This breakthrough is particularly significant for low-resource languages like Tatar, where high-quality parallel data is hard to come by. By lowering the data and time requirements for training high-quality models, our trans-tokenization strategy allows for the development of LLMs for a wider range of languages, especially those with limited resources. We hope that our work will inspire further research and collaboration in the field of cross-lingual vocabulary transfer and contribute to the empowerment of languages on a global scale.

跨语言词汇转移和跨语言标记化：LLM的语言适应在低资源NLP中的应用

Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP

摘要

Support