将LLM调整到希伯来语:揭示具有增强词汇量和指导能力的DictaLM 2.0
Adapting LLMs to Hebrew: Unveiling DictaLM 2.0 with Enhanced Vocabulary and Instruction Capabilities
July 9, 2024
作者: Shaltiel Shmidman, Avi Shmidman, Amir DN Cohen, Moshe Koppel
cs.AI
摘要
在低资源语言(如希伯来语)中训练大型语言模型(LLMs)面临独特挑战。本文介绍了DictaLM2.0和DictaLM2.0-Instruct,这两个LLMs源自Mistral模型,使用大约2000亿个标记在希伯来语和英语中进行训练。将预训练模型调整到新语言涉及专门技术,与从头开始训练模型或在英语等资源充足的语言上进一步训练现有模型有显著不同。我们概述了这些新颖的训练方法,有助于有效学习和适应希伯来语的语言特性。此外,我们在广泛的指导数据集上对DictaLM2.0-Instruct进行了微调,以提高其在特定任务指令上的性能。为了严格评估我们的模型,我们引入了一个新的希伯来语LLM评估基准套件,涵盖了一系列任务,包括问答、情感分析、Winograd Schema挑战、翻译和摘要。我们的工作不仅解决了在低资源语言中训练LLMs的复杂性,还提出了一个框架,可用于将其他LLMs调整到各种非英语语言,为多语言自然语言处理领域做出贡献。
English
Training large language models (LLMs) in low-resource languages such as
Hebrew poses unique challenges. In this paper, we introduce DictaLM2.0 and
DictaLM2.0-Instruct, two LLMs derived from the Mistral model, trained on a
substantial corpus of approximately 200 billion tokens in both Hebrew and
English. Adapting a pre-trained model to a new language involves specialized
techniques that differ significantly from training a model from scratch or
further training existing models on well-resourced languages such as English.
We outline these novel training methodologies, which facilitate effective
learning and adaptation to the linguistic properties of Hebrew. Additionally,
we fine-tuned DictaLM2.0-Instruct on a comprehensive instruct dataset to
enhance its performance on task-specific instructions. To rigorously evaluate
our models, we introduce a new benchmark suite for Hebrew LLM evaluation,
covering a diverse set of tasks including Question Answering, Sentiment
Analysis, Winograd Schema Challenge, Translation, and Summarization. Our work
not only addresses the intricacies of training LLMs in low-resource languages
but also proposes a framework that can be leveraged for adapting other LLMs to
various non-English languages, contributing to the broader field of
multilingual NLP.Summary
AI-Generated Summary