LLäMmlein: ゼロからのコンパクトで競争力のあるドイツ語専用言語モデル

要旨

私たちは、LL\"aMmlein 120Mと1Bという2つのドイツ語専用のデコーダーモデルをゼロから透明性を持って作成し、ドイツ語のNLP研究コミュニティが利用できるようにトレーニングデータと共に公開しました。モデルのトレーニングには、包括的なデータ前処理、カスタムドイツ語トークナイザーの作成、トレーニング自体、そして最終モデルのさまざまなベンチマークでの評価など、いくつかの重要なステップが含まれていました。トレーニングプロセス全体で、複数のチェックポイントが保存され、SuperGLEBerベンチマークを使用してモデルの学習ダイナミクスを監視するために分析されました。SuperGLEBerベンチマーク上の最先端モデルと比較して、両方のLL\"aMmleinモデルは競争力があり、類似のパラメータサイズを持つモデルと一致するか、それらを上回る結果を一貫して達成しました。結果は、モデルの品質が期待どおりにサイズと比例して向上することを示していますが、一部のタスクでの性能向上が早くに停滞したことから、将来のモデル開発におけるリソース配分に関する貴重な示唆が得られました。

English

We create two German-only decoder models, LL\"aMmlein 120M and 1B, transparently from scratch and publish them, along with the training data, for the German NLP research community to use. The model training involved several key steps, including extensive data preprocessing, the creation of a custom German tokenizer, the training itself, as well as the evaluation of the final models on various benchmarks. Throughout the training process, multiple checkpoints were saved and analyzed using the SuperGLEBer benchmark to monitor the models' learning dynamics. Compared to state-of-the-art models on the SuperGLEBer benchmark, both LL\"aMmlein models performed competitively, consistently matching or surpassing models with similar parameter sizes. The results show that the models' quality scales with size as expected, but performance improvements on some tasks plateaued early, offering valuable insights into resource allocation for future model development.

LLäMmlein: ゼロからのコンパクトで競争力のあるドイツ語専用言語モデル

LLäMmlein: Compact and Competitive German-Only Language Models from Scratch

要旨

Support