토큰화와 다국어 어휘 이전: 저자원 NLP를 위한 LLM의 언어 적응

초록

저자들은 낮은 및 중간 자원 언어를 위한 단일 언어 모델의 개발이 고품질 훈련 데이터를 확보하는 어려움으로 계속해서 방해받고 있다는 문제를 다루었습니다. 본 연구에서는 이러한 도전에 대처하고 더 효율적인 언어 적응을 가능하게 하는 새로운 교차언어 어휘 전이 전략인 'trans-tokenization'을 제시합니다. 저희의 방법은 고자원 단일 언어 모델을 보이지 않은 대상 언어에 적응시키기 위해 대상 언어의 토큰 임베딩을 소스 언어의 의미적으로 유사한 토큰 임베딩의 가중 평균을 사용하여 초기화하는 데 초점을 맞춥니다. 이를 위해 소스 언어와 대상 언어를 모두 다루는 번역 자원을 활용합니다. 저희는 Tweeties로 불리는 일련의 trans-tokenized LLM과 이들이 작은 다양한 언어 집합에서 다양한 하향 작업에서 경쟁력 있는 성능을 보여주는 것으로 이 방법을 검증하였습니다. 게다가, 여러 교체 가능한 언어 모델링 헤드와 임베딩 테이블을 갖는 모델인 Hydra LLM을 소개하였습니다. 이는 우리의 trans-tokenization 전략의 능력을 더욱 확장시킵니다. TowerInstruct 다국어 모델을 기반으로 한 Hydra LLM을 설계함으로써, 우리는 고품질 병렬 데이터 필요 없이 타타르어에 대한 최첨단 기계 번역 모델을 개발하였습니다. 이는 특히 타타르어와 같은 낮은 자원 언어에 있어서 고품질 병렬 데이터를 확보하기 어려운 경우에 중요한 진전입니다. 고품질 모델 훈련을 위한 데이터 및 시간 요구 사항을 낮춤으로써, 우리의 trans-tokenization 전략은 제한된 자원을 갖는 언어를 위한 LLM 개발을 가능하게 하며, 특히 한정된 자원을 갖는 언어에 대한 LLM의 개발을 허용합니다. 우리의 연구가 국제적인 언어들의 발전에 기여하고 교차언어 어휘 전이 분야에서의 추가 연구와 협력을 촉진할 것을 희망합니다.

English

The development of monolingual language models for low and mid-resource languages continues to be hindered by the difficulty in sourcing high-quality training data. In this study, we present a novel cross-lingual vocabulary transfer strategy, trans-tokenization, designed to tackle this challenge and enable more efficient language adaptation. Our approach focuses on adapting a high-resource monolingual LLM to an unseen target language by initializing the token embeddings of the target language using a weighted average of semantically similar token embeddings from the source language. For this, we leverage a translation resource covering both the source and target languages. We validate our method with the Tweeties, a series of trans-tokenized LLMs, and demonstrate their competitive performance on various downstream tasks across a small but diverse set of languages. Additionally, we introduce Hydra LLMs, models with multiple swappable language modeling heads and embedding tables, which further extend the capabilities of our trans-tokenization strategy. By designing a Hydra LLM based on the multilingual model TowerInstruct, we developed a state-of-the-art machine translation model for Tatar, in a zero-shot manner, completely bypassing the need for high-quality parallel data. This breakthrough is particularly significant for low-resource languages like Tatar, where high-quality parallel data is hard to come by. By lowering the data and time requirements for training high-quality models, our trans-tokenization strategy allows for the development of LLMs for a wider range of languages, especially those with limited resources. We hope that our work will inspire further research and collaboration in the field of cross-lingual vocabulary transfer and contribute to the empowerment of languages on a global scale.

토큰화와 다국어 어휘 이전: 저자원 NLP를 위한 LLM의 언어 적응

Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP

초록

Support