이탈리아어 최적화를 위한 LLM 개선: 어휘 적응을 통한 토큰 생산성 감소 및 효율성 향상

초록

사전 학습된 대형 언어 모델(LLM)의 수는 꾸준히 증가하고 있지만, 대부분은 주로 영어를 위해 설계되었습니다. 최첨단 LLM은 언어 오염이나 다국어 사전 학습 데이터의 어느 정도로 인해 다른 언어를 처리할 수 있지만, 비영어권 언어에 최적화되어 있지 않아 비효율적인 인코딩(높은 토큰 "생산성")과 느린 추론 속도를 초래합니다. 본 연구에서는 영어 LLM을 이탈리아어에 최적화하기 위한 다양한 어휘 적응 기법을 철저히 비교하고, 신경 매핑을 활용한 어휘 대체의 새로운 방법인 의미 정렬 어휘 적응(Semantic Alignment Vocabulary Adaptation, SAVA)을 제안합니다. SAVA는 다중 하위 작업에서 경쟁력 있는 성능을 달성하며, 근거 정렬 전략을 강화합니다. 우리는 두 가지 LLM을 적응시켰습니다: Mistral-7b-v0.1은 토큰 생산성을 25% 줄였고, Llama-3.1-8B는 어휘를 최적화하고 매개변수 수를 10억 개 줄였습니다. 어휘 적응 후, 이러한 모델들이 상대적으로 제한된 지속 학습 단계를 통해 목표 언어에서 성능을 회복할 수 있음을 보여줍니다. 마지막으로, 적응된 모델의 능력을 다양한 객관식 및 생성 작업에서 테스트합니다.

English

The number of pretrained Large Language Models (LLMs) is increasing steadily, though the majority are designed predominantly for the English language. While state-of-the-art LLMs can handle other languages, due to language contamination or some degree of multilingual pretraining data, they are not optimized for non-English languages, leading to inefficient encoding (high token "fertility") and slower inference speed. In this work, we thoroughly compare a variety of vocabulary adaptation techniques for optimizing English LLMs for the Italian language, and put forward Semantic Alignment Vocabulary Adaptation (SAVA), a novel method that leverages neural mapping for vocabulary substitution. SAVA achieves competitive performance across multiple downstream tasks, enhancing grounded alignment strategies. We adapt two LLMs: Mistral-7b-v0.1, reducing token fertility by 25\%, and Llama-3.1-8B, optimizing the vocabulary and reducing the number of parameters by 1 billion. We show that, following the adaptation of the vocabulary, these models can recover their performance with a relatively limited stage of continual training on the target language. Finally, we test the capabilities of the adapted models on various multi-choice and generative tasks.

이탈리아어 최적화를 위한 LLM 개선: 어휘 적응을 통한 토큰 생산성 감소 및 효율성 향상

Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation

초록

Support