BaichuanSEED：競争力のある大規模言語モデルベースラインを導入することにより、広範なデータ収集と重複排除の潜在能力を共有する

要旨

大規模言語モデル（LLM）の一般的な能力は、幅広い事前学習データセットの構成と選択に大きく依存しており、これらはいくつかの機関によって営利秘密として扱われています。この問題を緩和するために、普遍的に適用可能なデータ処理パイプラインの詳細をオープンソース化し、その効果と可能性を検証することで、競争力のあるLLMのベースラインを導入します。具体的には、データ処理パイプラインは、広範な収集からスケーリングアップ、品質向上のための再重み付けまでを含みます。その後、当社のパイプラインで処理された3兆トークンを用いて、意図的な下流タスク関連の最適化を行わずに、7BモデルBaichuanSEEDを事前学習し、簡単ですが効果的な教師ありファインチューニング段階を行います。BaichuanSEEDは、トレーニング全体で一貫性と予測可能性を示し、Qwen1.5やLlama3などのいくつかの商用の先進的な大規模言語モデルと比較して、包括的なベンチマークで同等のパフォーマンスを達成します。また、数学やコーディングなどの下流タスクのさらなる最適化の可能性について議論するために、いくつかのヒューリスティック実験も実施します。

English

The general capabilities of Large Language Models (LLM) highly rely on the composition and selection on extensive pretraining datasets, treated as commercial secrets by several institutions. To mitigate this issue, we open-source the details of a universally applicable data processing pipeline and validate its effectiveness and potential by introducing a competitive LLM baseline. Specifically, the data processing pipeline consists of broad collection to scale up and reweighting to improve quality. We then pretrain a 7B model BaichuanSEED with 3T tokens processed by our pipeline without any deliberate downstream task-related optimization, followed by an easy but effective supervised fine-tuning stage. BaichuanSEED demonstrates consistency and predictability throughout training and achieves comparable performance on comprehensive benchmarks with several commercial advanced large language models, such as Qwen1.5 and Llama3. We also conduct several heuristic experiments to discuss the potential for further optimization of downstream tasks, such as mathematics and coding.

BaichuanSEED：競争力のある大規模言語モデルベースラインを導入することにより、広範なデータ収集と重複排除の潜在能力を共有する

BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline

要旨

Support