ChatPaper.aiChatPaper

MiniCPM:揭示小型語言模型潛力的可擴展訓練策略

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

April 9, 2024
作者: Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, Xinrong Zhang, Zheng Leng Thai, Kaihuo Zhang, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, Dahai Li, Zhiyuan Liu, Maosong Sun
cs.AI

摘要

在開發擁有高達一兆參數的大型語言模型(LLMs)方面,越來越多的興趣引發了對資源效率和實際成本的擔憂,特別是考慮到實驗的巨大成本。這種情況凸顯了探索小型語言模型(SLMs)作為資源高效替代方案的重要性。在這種情況下,我們介紹了MiniCPM,具體而言是1.2B和2.4B的非嵌入參數變體,不僅在各自的類別中表現出色,而且展示了與7B-13B LLMs相當的能力。在專注於SLMs的同時,我們的方法展示了未來LLM研究中模型和數據維度的可擴展性。在模型擴展方面,我們進行了大量穩定且最佳的模型風洞實驗。在數據擴展方面,我們引入了一種適用於持續訓練和領域適應的Warmup-Stable-Decay(WSD)學習率調度器(LRS)。我們對WSD LRS中發生的引人入勝的訓練動態進行了深入分析。有了WSD LRS,我們現在能夠高效地研究數據-模型擴展定律,而無需在模型和數據的兩個軸上進行大量的重新訓練實驗,從中我們得出了比Chinchilla Optimal更高的計算最佳數據-模型比率。此外,我們還介紹了MiniCPM家族,包括MiniCPM-DPO、MiniCPM-MoE和MiniCPM-128K,其出色的性能進一步鞏固了MiniCPM在各種SLM應用中的基礎。MiniCPM模型可在https://github.com/OpenBMB/MiniCPM 上公開獲得。
English
The burgeoning interest in developing Large Language Models (LLMs) with up to trillion parameters has been met with concerns regarding resource efficiency and practical expense, particularly given the immense cost of experimentation. This scenario underscores the importance of exploring the potential of Small Language Models (SLMs) as a resource-efficient alternative. In this context, we introduce MiniCPM, specifically the 1.2B and 2.4B non-embedding parameter variants, not only excel in their respective categories but also demonstrate capabilities on par with 7B-13B LLMs. While focusing on SLMs, our approach exhibits scalability in both model and data dimensions for future LLM research. Regarding model scaling, we employ extensive model wind tunnel experiments for stable and optimal scaling. For data scaling, we introduce a Warmup-Stable-Decay (WSD) learning rate scheduler (LRS), conducive to continuous training and domain adaptation. We present an in-depth analysis of the intriguing training dynamics that occurred in the WSD LRS. With WSD LRS, we are now able to efficiently study data-model scaling law without extensive retraining experiments on both axes of model and data, from which we derive the much higher compute optimal data-model ratio than Chinchilla Optimal. Additionally, we introduce MiniCPM family, including MiniCPM-DPO, MiniCPM-MoE and MiniCPM-128K, whose excellent performance further cementing MiniCPM's foundation in diverse SLM applications. MiniCPM models are available publicly at https://github.com/OpenBMB/MiniCPM .

Summary

AI-Generated Summary

PDF231December 15, 2024