將LLMs調整至希伯來語:揭示具備增強詞彙和指導能力的DictaLM 2.0
Adapting LLMs to Hebrew: Unveiling DictaLM 2.0 with Enhanced Vocabulary and Instruction Capabilities
July 9, 2024
作者: Shaltiel Shmidman, Avi Shmidman, Amir DN Cohen, Moshe Koppel
cs.AI
摘要
在低資源語言(例如希伯來語)中訓練大型語言模型(LLMs)存在獨特挑戰。本文介紹了DictaLM2.0和DictaLM2.0-Instruct,這兩個LLMs源自Mistral模型,在希伯來語和英語的龐大語料庫中進行了訓練,總計約2000億個標記。將預訓練模型適應到新語言涉及專門的技術,與從頭開始訓練模型或進一步訓練現有模型以英語等資源豐富的語言有顯著不同。我們概述了這些新穎的訓練方法,有助於有效學習和適應希伯來語的語言特性。此外,我們在一個全面的指令數據集上對DictaLM2.0-Instruct進行了微調,以提高其在特定任務指令上的表現。為了嚴格評估我們的模型,我們引入了一個新的希伯來語LLM評估基準套件,涵蓋了各種任務,包括問答、情感分析、Winograd模式挑戰、翻譯和摘要。我們的工作不僅解決了在低資源語言中訓練LLMs的複雜性,還提出了一個框架,可用於將其他LLMs適應到各種非英語語言,從而為多語言自然語言處理的廣泛領域做出貢獻。
English
Training large language models (LLMs) in low-resource languages such as
Hebrew poses unique challenges. In this paper, we introduce DictaLM2.0 and
DictaLM2.0-Instruct, two LLMs derived from the Mistral model, trained on a
substantial corpus of approximately 200 billion tokens in both Hebrew and
English. Adapting a pre-trained model to a new language involves specialized
techniques that differ significantly from training a model from scratch or
further training existing models on well-resourced languages such as English.
We outline these novel training methodologies, which facilitate effective
learning and adaptation to the linguistic properties of Hebrew. Additionally,
we fine-tuned DictaLM2.0-Instruct on a comprehensive instruct dataset to
enhance its performance on task-specific instructions. To rigorously evaluate
our models, we introduce a new benchmark suite for Hebrew LLM evaluation,
covering a diverse set of tasks including Question Answering, Sentiment
Analysis, Winograd Schema Challenge, Translation, and Summarization. Our work
not only addresses the intricacies of training LLMs in low-resource languages
but also proposes a framework that can be leveraged for adapting other LLMs to
various non-English languages, contributing to the broader field of
multilingual NLP.Summary
AI-Generated Summary