トークン間変換と多言語語彙転移：低リソースNLPのための大規模言語モデルの言語適応

要旨

低・中リソース言語における単一言語モデルの開発は、高品質なトレーニングデータの収集が困難であることから、依然として阻害されています。本研究では、この課題に取り組み、より効率的な言語適応を可能にするための新しいクロスリンガル語彙転送戦略である「トランストークン化」を提案します。私たちのアプローチは、高リソースの単一言語LLMを未見のターゲット言語に適応させることに焦点を当て、ターゲット言語のトークン埋め込みを、ソース言語の意味的に類似したトークン埋め込みの加重平均で初期化します。これには、ソース言語とターゲット言語の両方をカバーする翻訳リソースを活用します。私たちは、トランストークン化されたLLMシリーズである「Tweeties」を用いてこの手法を検証し、少数ながら多様な言語セットにおける様々な下流タスクでの競争力のある性能を実証しました。さらに、複数の交換可能な言語モデリングヘッドと埋め込みテーブルを持つ「Hydra LLM」を導入し、トランストークン化戦略の能力をさらに拡張します。多言語モデル「TowerInstruct」に基づいてHydra LLMを設計することで、高品質な並列データを完全に回避し、ゼロショット方式でタタール語の最先端機械翻訳モデルを開発しました。このブレークスルーは、タタール語のような低リソース言語にとって特に重要です。なぜなら、高品質な並列データが入手困難なためです。高品質なモデルのトレーニングに必要なデータと時間の要件を下げることで、私たちのトランストークン化戦略は、特にリソースが限られた言語を含む、より広範な言語のLLM開発を可能にします。私たちの研究が、クロスリンガル語彙転送の分野におけるさらなる研究と協力を刺激し、グローバル規模での言語のエンパワーメントに貢献することを願っています。

English

The development of monolingual language models for low and mid-resource languages continues to be hindered by the difficulty in sourcing high-quality training data. In this study, we present a novel cross-lingual vocabulary transfer strategy, trans-tokenization, designed to tackle this challenge and enable more efficient language adaptation. Our approach focuses on adapting a high-resource monolingual LLM to an unseen target language by initializing the token embeddings of the target language using a weighted average of semantically similar token embeddings from the source language. For this, we leverage a translation resource covering both the source and target languages. We validate our method with the Tweeties, a series of trans-tokenized LLMs, and demonstrate their competitive performance on various downstream tasks across a small but diverse set of languages. Additionally, we introduce Hydra LLMs, models with multiple swappable language modeling heads and embedding tables, which further extend the capabilities of our trans-tokenization strategy. By designing a Hydra LLM based on the multilingual model TowerInstruct, we developed a state-of-the-art machine translation model for Tatar, in a zero-shot manner, completely bypassing the need for high-quality parallel data. This breakthrough is particularly significant for low-resource languages like Tatar, where high-quality parallel data is hard to come by. By lowering the data and time requirements for training high-quality models, our trans-tokenization strategy allows for the development of LLMs for a wider range of languages, especially those with limited resources. We hope that our work will inspire further research and collaboration in the field of cross-lingual vocabulary transfer and contribute to the empowerment of languages on a global scale.

トークン間変換と多言語語彙転移：低リソースNLPのための大規模言語モデルの言語適応

Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP

要旨

Support