ChatPaper.aiChatPaper

优化意大利语大语言模型:通过词汇适配降低标记生成率并提升效率

Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation

April 23, 2025
作者: Luca Moroni, Giovanni Puccetti, Pere-Lluis Huguet Cabot, Andrei Stefan Bejgu, Edoardo Barba, Alessio Miaschi, Felice Dell'Orletta, Andrea Esuli, Roberto Navigli
cs.AI

摘要

预训练大语言模型(LLMs)的数量正稳步增长,然而其中大多数主要针对英语设计。尽管最先进的LLMs能够处理其他语言,这得益于语言混杂或一定程度的多语言预训练数据,但它们并未针对非英语语言进行优化,导致编码效率低下(高“token生育率”)和推理速度较慢。在本研究中,我们深入比较了多种词汇适应技术,旨在优化英语LLMs以适应意大利语,并提出了一种新颖的方法——语义对齐词汇适应(SAVA),该方法利用神经映射进行词汇替换。SAVA在多项下游任务中展现出竞争力,强化了基础对齐策略。我们适配了两款LLM:Mistral-7b-v0.1,将token生育率降低了25%;以及Llama-3.1-8B,优化了词汇并减少了10亿参数。我们证明,在完成词汇适应后,这些模型通过目标语言上相对有限的持续训练阶段即可恢复性能。最后,我们在多项选择题和生成任务上测试了适配后模型的能力。
English
The number of pretrained Large Language Models (LLMs) is increasing steadily, though the majority are designed predominantly for the English language. While state-of-the-art LLMs can handle other languages, due to language contamination or some degree of multilingual pretraining data, they are not optimized for non-English languages, leading to inefficient encoding (high token "fertility") and slower inference speed. In this work, we thoroughly compare a variety of vocabulary adaptation techniques for optimizing English LLMs for the Italian language, and put forward Semantic Alignment Vocabulary Adaptation (SAVA), a novel method that leverages neural mapping for vocabulary substitution. SAVA achieves competitive performance across multiple downstream tasks, enhancing grounded alignment strategies. We adapt two LLMs: Mistral-7b-v0.1, reducing token fertility by 25\%, and Llama-3.1-8B, optimizing the vocabulary and reducing the number of parameters by 1 billion. We show that, following the adaptation of the vocabulary, these models can recover their performance with a relatively limited stage of continual training on the target language. Finally, we test the capabilities of the adapted models on various multi-choice and generative tasks.

Summary

AI-Generated Summary

PDF141April 28, 2025