优化意大利语大语言模型:通过词汇适配降低标记生成率并提升效率
Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation
April 23, 2025
作者: Luca Moroni, Giovanni Puccetti, Pere-Lluis Huguet Cabot, Andrei Stefan Bejgu, Edoardo Barba, Alessio Miaschi, Felice Dell'Orletta, Andrea Esuli, Roberto Navigli
cs.AI
摘要
预训练大语言模型(LLMs)的数量正稳步增长,然而其中大多数主要针对英语设计。尽管最先进的LLMs能够处理其他语言,这得益于语言混杂或一定程度的多语言预训练数据,但它们并未针对非英语语言进行优化,导致编码效率低下(高“token生育率”)和推理速度较慢。在本研究中,我们深入比较了多种词汇适应技术,旨在优化英语LLMs以适应意大利语,并提出了一种新颖的方法——语义对齐词汇适应(SAVA),该方法利用神经映射进行词汇替换。SAVA在多项下游任务中展现出竞争力,强化了基础对齐策略。我们适配了两款LLM:Mistral-7b-v0.1,将token生育率降低了25%;以及Llama-3.1-8B,优化了词汇并减少了10亿参数。我们证明,在完成词汇适应后,这些模型通过目标语言上相对有限的持续训练阶段即可恢复性能。最后,我们在多项选择题和生成任务上测试了适配后模型的能力。
English
The number of pretrained Large Language Models (LLMs) is increasing steadily,
though the majority are designed predominantly for the English language. While
state-of-the-art LLMs can handle other languages, due to language contamination
or some degree of multilingual pretraining data, they are not optimized for
non-English languages, leading to inefficient encoding (high token "fertility")
and slower inference speed. In this work, we thoroughly compare a variety of
vocabulary adaptation techniques for optimizing English LLMs for the Italian
language, and put forward Semantic Alignment Vocabulary Adaptation (SAVA), a
novel method that leverages neural mapping for vocabulary substitution. SAVA
achieves competitive performance across multiple downstream tasks, enhancing
grounded alignment strategies. We adapt two LLMs: Mistral-7b-v0.1, reducing
token fertility by 25\%, and Llama-3.1-8B, optimizing the vocabulary and
reducing the number of parameters by 1 billion. We show that, following the
adaptation of the vocabulary, these models can recover their performance with a
relatively limited stage of continual training on the target language. Finally,
we test the capabilities of the adapted models on various multi-choice and
generative tasks.Summary
AI-Generated Summary