CroissantLLM: 真の二言語対応フランス語-英語言語モデル

要旨

私たちはCroissantLLMを紹介します。これは1.3Bパラメータの言語モデルで、3兆の英語とフランス語のトークンで事前学習されており、研究および産業コミュニティに高性能で完全にオープンソースのバイリンガルモデルを提供し、消費者向けのローカルハードウェアで迅速に動作します。そのために、1:1の英語対フランス語の事前学習データ比率、カスタムトークナイザー、およびバイリンガルのファインチューニングデータセットを使用して、本質的にバイリンガルなモデルをトレーニングするアプローチを開拓しました。私たちはトレーニングデータセットを公開し、特に手作業でキュレーションされた高品質で多様なデータソースを含むフランス語の分割を提供します。英語以外のパフォーマンスを評価するために、フランス語におけるモデルのパフォーマンスのさまざまな直交する側面をカバーする分類および生成タスクの配列からなる新しいベンチマーク、FrenchBenchを作成しました。さらに、透明性に根ざし、大規模言語モデルの研究を促進するために、コードベースやさまざまなモデルサイズ、トレーニングデータ分布、トレーニングステップにわたる数十のチェックポイント、ファインチューニングされたチャットモデル、強力な翻訳モデルを公開します。私たちはFMTIフレームワークを通じてモデルを評価し、透明性基準の81％を検証し、ほとんどのオープンイニシアチブをはるかに超えるスコアを達成しました。この研究はNLPの風景を豊かにし、以前の英語中心の研究から脱却して、言語モデルにおける多言語性の理解を強化します。

English

We introduce CroissantLLM, a 1.3B language model pretrained on a set of 3T English and French tokens, to bring to the research and industrial community a high-performance, fully open-sourced bilingual model that runs swiftly on consumer-grade local hardware. To that end, we pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio, a custom tokenizer, and bilingual finetuning datasets. We release the training dataset, notably containing a French split with manually curated, high-quality, and varied data sources. To assess performance outside of English, we craft a novel benchmark, FrenchBench, consisting of an array of classification and generation tasks, covering various orthogonal aspects of model performance in the French Language. Additionally, rooted in transparency and to foster further Large Language Model research, we release codebases, and dozens of checkpoints across various model sizes, training data distributions, and training steps, as well as fine-tuned Chat models, and strong translation models. We evaluate our model through the FMTI framework, and validate 81 % of the transparency criteria, far beyond the scores of even most open initiatives. This work enriches the NLP landscape, breaking away from previous English-centric work in order to strengthen our understanding of multilinguality in language models.

CroissantLLM: 真の二言語対応フランス語-英語言語モデル

CroissantLLM: A Truly Bilingual French-English Language Model

要旨

Support