ChatPaper.aiChatPaper

大型語言模型預訓練中的模型融合

Model Merging in Pre-training of Large Language Models

May 17, 2025
作者: Yunshui Li, Yiyuan Ma, Shen Yan, Chaoyi Zhang, Jing Liu, Jianqiao Lu, Ziwen Xu, Mengzhao Chen, Minrui Wang, Shiyi Zhan, Jin Ma, Xunhao Lai, Yao Luo, Xingyan Bin, Hongbin Ren, Mingji Han, Wenhao Hao, Bairen Yi, LingJun Liu, Bole Ma, Xiaoying Jia, Zhou Xun, Liang Xiang, Yonghui Wu
cs.AI

摘要

模型融合已成為增強大型語言模型的一項前景廣闊的技術,儘管其在大規模預訓練中的應用仍相對未被充分探索。本文中,我們對預訓練過程中的模型融合技術進行了全面研究。通過對從數百萬到超過1000億參數的密集架構和專家混合(MoE)架構進行廣泛實驗,我們證明,融合使用恆定學習率訓練的檢查點不僅能顯著提升性能,還能準確預測退火行為。這些改進既提高了模型開發的效率,也大幅降低了訓練成本。我們對融合策略和超參數的詳細消融研究提供了對底層機制的新見解,同時揭示了新的應用場景。通過全面的實驗分析,我們為開源社區提供了有效的模型融合預訓練實用指南。
English
Model merging has emerged as a promising technique for enhancing large language models, though its application in large-scale pre-training remains relatively unexplored. In this paper, we present a comprehensive investigation of model merging techniques during the pre-training process. Through extensive experiments with both dense and Mixture-of-Experts (MoE) architectures ranging from millions to over 100 billion parameters, we demonstrate that merging checkpoints trained with constant learning rates not only achieves significant performance improvements but also enables accurate prediction of annealing behavior. These improvements lead to both more efficient model development and significantly lower training costs. Our detailed ablation studies on merging strategies and hyperparameters provide new insights into the underlying mechanisms while uncovering novel applications. Through comprehensive experimental analysis, we offer the open-source community practical pre-training guidelines for effective model merging.

Summary

AI-Generated Summary

PDF255May 20, 2025