CroissantLLM：一個真正的雙語法文-英文語言模型

摘要

我們介紹 CroissantLLM，這是一個預先訓練在 3T 個英語和法語 tokens 上的 13 億規模語言模型，旨在為研究和工業社區提供高性能、完全開源的雙語模型，可在消費者級本地硬件上快速運行。為此，我們開創了一種訓練內在雙語模型的方法，具有 1:1 的英語到法語預訓練數據比例，自定義的分詞器，以及雙語微調數據集。我們釋出了訓練數據集，特別包括一個法語分割，其中包含手動精心策劃、高質量和多樣化的數據來源。為了評估在英語之外的性能，我們創建了一個新穎的基準測試 FrenchBench，包括一系列分類和生成任務，涵蓋法語語言模型性能的各個正交方面。此外，基於透明度並促進更多大型語言模型研究，我們釋出了代碼庫、數十個不同模型尺寸、訓練數據分佈和訓練步驟的檢查點，以及經過微調的 Chat 模型和強大的翻譯模型。我們通過 FMTI 框架評估我們的模型，並驗證了 81% 的透明度標準，遠遠超出了甚至大多數開放倡議的得分。這項工作豐富了 NLP 領域，擺脫了以往以英語為中心的工作，以加強我們對語言模型中多語性的理解。

English

We introduce CroissantLLM, a 1.3B language model pretrained on a set of 3T English and French tokens, to bring to the research and industrial community a high-performance, fully open-sourced bilingual model that runs swiftly on consumer-grade local hardware. To that end, we pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio, a custom tokenizer, and bilingual finetuning datasets. We release the training dataset, notably containing a French split with manually curated, high-quality, and varied data sources. To assess performance outside of English, we craft a novel benchmark, FrenchBench, consisting of an array of classification and generation tasks, covering various orthogonal aspects of model performance in the French Language. Additionally, rooted in transparency and to foster further Large Language Model research, we release codebases, and dozens of checkpoints across various model sizes, training data distributions, and training steps, as well as fine-tuned Chat models, and strong translation models. We evaluate our model through the FMTI framework, and validate 81 % of the transparency criteria, far beyond the scores of even most open initiatives. This work enriches the NLP landscape, breaking away from previous English-centric work in order to strengthen our understanding of multilinguality in language models.

CroissantLLM：一個真正的雙語法文-英文語言模型

CroissantLLM: A Truly Bilingual French-English Language Model

摘要

Support