Steel-LLM: ゼロからオープンソースへ-- 中国中心のLLMを構築する個人の旅

要旨

Steel-LLMは、限られた計算リソースにもかかわらず、高品質でオープンソースのモデルを作成することを目指して、ゼロから開発された中国中心の言語モデルです。2024年3月に立ち上げられたこのプロジェクトは、大規模なデータセットで10億パラメータのモデルを訓練することを目指し、透明性と実践的な知見の共有を重視し、コミュニティ内の他者の支援を図っています。訓練プロセスは主に中国語データに焦点を当てており、一部の英語データも含まれており、既存のオープンソースの言語モデルの不足を補うことで、モデル構築の過程についてより詳細で実践的な説明を提供しています。Steel-LLMは、CEVALやCMMLUなどのベンチマークで競争力のあるパフォーマンスを示し、より大規模な機関の初期モデルを凌駕しています。本論文では、データ収集、モデル設計、訓練方法、および遭遇した課題など、プロジェクトの主要な貢献の包括的な要約を提供し、独自の言語モデルを開発しようとする研究者や実務家にとって貴重なリソースとなります。モデルのチェックポイントと訓練スクリプトは、https://github.com/zhanshijinwat/Steel-LLM で入手可能です。

English

Steel-LLM is a Chinese-centric language model developed from scratch with the goal of creating a high-quality, open-source model despite limited computational resources. Launched in March 2024, the project aimed to train a 1-billion-parameter model on a large-scale dataset, prioritizing transparency and the sharing of practical insights to assist others in the community. The training process primarily focused on Chinese data, with a small proportion of English data included, addressing gaps in existing open-source LLMs by providing a more detailed and practical account of the model-building journey. Steel-LLM has demonstrated competitive performance on benchmarks such as CEVAL and CMMLU, outperforming early models from larger institutions. This paper provides a comprehensive summary of the project's key contributions, including data collection, model design, training methodologies, and the challenges encountered along the way, offering a valuable resource for researchers and practitioners looking to develop their own LLMs. The model checkpoints and training script are available at https://github.com/zhanshijinwat/Steel-LLM.

Steel-LLM: ゼロからオープンソースへ-- 中国中心のLLMを構築する個人の旅

Steel-LLM:From Scratch to Open Source -- A Personal Journey in Building a Chinese-Centric LLM

要旨

Support