BaichuanSEED:通过引入竞争性大型语言模型基准线,分享广泛数据收集和去重的潜力
BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline
August 27, 2024
作者: Guosheng Dong, Da Pan, Yiding Sun, Shusen Zhang, Zheng Liang, Xin Wu, Yanjun Shen, Fan Yang, Haoze Sun, Tianpeng Li, Mingan Lin, Jianhua Xu, Yufan Zhang, Xiaonan Nie, Lei Su, Bingning Wang, Wentao Zhang, Jiaxin Mao, Zenan Zhou, Weipeng Chen
cs.AI
摘要
大型语言模型(LLM)的一般能力高度依赖于广泛的预训练数据集的组成和选择,这些数据集被一些机构视为商业机密。为了缓解这一问题,我们公开了一个通用的数据处理流程的细节,并通过引入一个竞争性的LLM基线来验证其有效性和潜力。具体而言,数据处理流程包括广泛的收集以扩大规模,以及重新加权以提高质量。然后,我们使用我们的流程处理了30亿标记,预训练了一个7B模型BaichuanSEED,没有进行任何刻意针对下游任务的优化,接着是一个简单但有效的监督微调阶段。BaichuanSEED在整个训练过程中表现一致且可预测,并在多个综合基准测试中取得了与几种商业先进大型语言模型(如Qwen1.5和Llama3)可比的性能。我们还进行了几个启发式实验,讨论了进一步优化下游任务(如数学和编码)的潜力。
English
The general capabilities of Large Language Models (LLM) highly rely on the
composition and selection on extensive pretraining datasets, treated as
commercial secrets by several institutions. To mitigate this issue, we
open-source the details of a universally applicable data processing pipeline
and validate its effectiveness and potential by introducing a competitive LLM
baseline. Specifically, the data processing pipeline consists of broad
collection to scale up and reweighting to improve quality. We then pretrain a
7B model BaichuanSEED with 3T tokens processed by our pipeline without any
deliberate downstream task-related optimization, followed by an easy but
effective supervised fine-tuning stage. BaichuanSEED demonstrates consistency
and predictability throughout training and achieves comparable performance on
comprehensive benchmarks with several commercial advanced large language
models, such as Qwen1.5 and Llama3. We also conduct several heuristic
experiments to discuss the potential for further optimization of downstream
tasks, such as mathematics and coding.Summary
AI-Generated Summary