CroissantLLM：一个真正的双语法语-英语语言模型

摘要

我们介绍了CroissantLLM，这是一个预训练在包含30亿英语和法语标记的数据集上的13亿语言模型，旨在为研究和工业界提供一款高性能、完全开源的双语模型，能够在消费级本地硬件上快速运行。为此，我们首创了训练固有双语模型的方法，采用了1:1的英语到法语预训练数据比例、自定义分词器以及双语微调数据集。我们发布了训练数据集，其中特别包括一个法语数据集，其中包含手动策划、高质量和多样化数据源。为了评估模型在英语之外的性能，我们构建了一个新的基准FrenchBench，其中包括一系列分类和生成任务，涵盖了模型在法语中性能的各个正交方面。此外，基于透明度并促进更多大型语言模型研究，我们发布了代码库和数十个检查点，涵盖了各种模型大小、训练数据分布和训练步骤，以及经过精细调整的Chat模型和强大的翻译模型。我们通过FMTI框架评估了我们的模型，并验证了81%的透明度标准，远远超过大多数开放倡议的分数。这项工作丰富了自然语言处理领域，摆脱了以往以英语为中心的工作，以加强我们对语言模型中多语言性的理解。

English

We introduce CroissantLLM, a 1.3B language model pretrained on a set of 3T English and French tokens, to bring to the research and industrial community a high-performance, fully open-sourced bilingual model that runs swiftly on consumer-grade local hardware. To that end, we pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio, a custom tokenizer, and bilingual finetuning datasets. We release the training dataset, notably containing a French split with manually curated, high-quality, and varied data sources. To assess performance outside of English, we craft a novel benchmark, FrenchBench, consisting of an array of classification and generation tasks, covering various orthogonal aspects of model performance in the French Language. Additionally, rooted in transparency and to foster further Large Language Model research, we release codebases, and dozens of checkpoints across various model sizes, training data distributions, and training steps, as well as fine-tuned Chat models, and strong translation models. We evaluate our model through the FMTI framework, and validate 81 % of the transparency criteria, far beyond the scores of even most open initiatives. This work enriches the NLP landscape, breaking away from previous English-centric work in order to strengthen our understanding of multilinguality in language models.

CroissantLLM：一个真正的双语法语-英语语言模型

CroissantLLM: A Truly Bilingual French-English Language Model

摘要

Support