优化意大利语大语言模型：通过词汇适配降低标记生成率并提升效率

摘要

预训练大语言模型（LLMs）的数量正稳步增长，然而其中大多数主要针对英语设计。尽管最先进的LLMs能够处理其他语言，这得益于语言混杂或一定程度的多语言预训练数据，但它们并未针对非英语语言进行优化，导致编码效率低下（高“token生育率”）和推理速度较慢。在本研究中，我们深入比较了多种词汇适应技术，旨在优化英语LLMs以适应意大利语，并提出了一种新颖的方法——语义对齐词汇适应（SAVA），该方法利用神经映射进行词汇替换。SAVA在多项下游任务中展现出竞争力，强化了基础对齐策略。我们适配了两款LLM：Mistral-7b-v0.1，将token生育率降低了25%；以及Llama-3.1-8B，优化了词汇并减少了10亿参数。我们证明，在完成词汇适应后，这些模型通过目标语言上相对有限的持续训练阶段即可恢复性能。最后，我们在多项选择题和生成任务上测试了适配后模型的能力。

English

The number of pretrained Large Language Models (LLMs) is increasing steadily, though the majority are designed predominantly for the English language. While state-of-the-art LLMs can handle other languages, due to language contamination or some degree of multilingual pretraining data, they are not optimized for non-English languages, leading to inefficient encoding (high token "fertility") and slower inference speed. In this work, we thoroughly compare a variety of vocabulary adaptation techniques for optimizing English LLMs for the Italian language, and put forward Semantic Alignment Vocabulary Adaptation (SAVA), a novel method that leverages neural mapping for vocabulary substitution. SAVA achieves competitive performance across multiple downstream tasks, enhancing grounded alignment strategies. We adapt two LLMs: Mistral-7b-v0.1, reducing token fertility by 25\%, and Llama-3.1-8B, optimizing the vocabulary and reducing the number of parameters by 1 billion. We show that, following the adaptation of the vocabulary, these models can recover their performance with a relatively limited stage of continual training on the target language. Finally, we test the capabilities of the adapted models on various multi-choice and generative tasks.

优化意大利语大语言模型：通过词汇适配降低标记生成率并提升效率

Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation

摘要

Support