MiniCPM:通过可扩展的训练策略揭示小型语言模型的潜力
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
April 9, 2024
作者: Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, Xinrong Zhang, Zheng Leng Thai, Kaihuo Zhang, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, Dahai Li, Zhiyuan Liu, Maosong Sun
cs.AI
摘要
随着对拥有高达万亿参数的大型语言模型(LLMs)开发日益增长的兴趣,人们开始关注资源效率和实际成本,尤其是考虑到实验的巨大成本。这种情况凸显了探索小型语言模型(SLMs)作为资源高效替代方案的重要性。在这种背景下,我们介绍了MiniCPM,特别是1.2B和2.4B非嵌入参数变体,不仅在各自的类别中表现出色,而且展示了与7B-13B LLMs相媲美的能力。在专注于SLMs的同时,我们的方法在未来LLM研究中展现了模型和数据维度的可扩展性。在模型扩展方面,我们进行了大量的模型风洞实验,以实现稳定和最佳的扩展。在数据扩展方面,我们引入了一个适用于持续训练和领域适应的Warmup-Stable-Decay(WSD)学习率调度器(LRS)。我们对WSD LRS中发生的有趣训练动态进行了深入分析。借助WSD LRS,我们现在能够高效地研究数据-模型扩展规律,而无需在模型和数据的两个轴上进行大量的重新训练实验,从中我们得出了比Chinchilla Optimal更高的计算最优数据-模型比例。此外,我们还介绍了MiniCPM家族,包括MiniCPM-DPO、MiniCPM-MoE和MiniCPM-128K,它们的出色表现进一步巩固了MiniCPM在各种SLM应用中的基础。MiniCPM模型可在https://github.com/OpenBMB/MiniCPM 上公开获取。
English
The burgeoning interest in developing Large Language Models (LLMs) with up to
trillion parameters has been met with concerns regarding resource efficiency
and practical expense, particularly given the immense cost of experimentation.
This scenario underscores the importance of exploring the potential of Small
Language Models (SLMs) as a resource-efficient alternative. In this context, we
introduce MiniCPM, specifically the 1.2B and 2.4B non-embedding parameter
variants, not only excel in their respective categories but also demonstrate
capabilities on par with 7B-13B LLMs. While focusing on SLMs, our approach
exhibits scalability in both model and data dimensions for future LLM research.
Regarding model scaling, we employ extensive model wind tunnel experiments for
stable and optimal scaling. For data scaling, we introduce a
Warmup-Stable-Decay (WSD) learning rate scheduler (LRS), conducive to
continuous training and domain adaptation. We present an in-depth analysis of
the intriguing training dynamics that occurred in the WSD LRS. With WSD LRS, we
are now able to efficiently study data-model scaling law without extensive
retraining experiments on both axes of model and data, from which we derive the
much higher compute optimal data-model ratio than Chinchilla Optimal.
Additionally, we introduce MiniCPM family, including MiniCPM-DPO, MiniCPM-MoE
and MiniCPM-128K, whose excellent performance further cementing MiniCPM's
foundation in diverse SLM applications. MiniCPM models are available publicly
at https://github.com/OpenBMB/MiniCPM .Summary
AI-Generated Summary