晚训早成:让大语言模型提前学习,实现更快速高效的提升
Late-to-Early Training: LET LLMs Learn Earlier, So Faster and Better
February 5, 2026
作者: Ji Zhao, Yufei Gu, Shitong Shao, Xun Zhou, Liang Xiang, Zeke Xie
cs.AI
摘要
随着大语言模型通过扩展模型与数据规模取得显著实证成果,预训练的重要性日益凸显,但其计算成本已高昂到阻碍快速发展的程度。尽管已有大量耗费巨量算力开发的预训练模型可用,一个现实中的根本性问题仍未被充分探索:能否利用现有小型预训练模型来加速大模型的训练?本文提出一种"后训前导"范式,使大语言模型能够在早期训练阶段和浅层网络显式学习深层知识。其核心思想是:在训练初期,使用已完成预训练(即处于后期训练阶段)模型的深层表征来指导目标模型的浅层网络。我们揭示了驱动该范式有效的两大机制:后阶段至前阶段学习与深层至浅层学习。这些机制在显著加速训练收敛的同时,持续增强模型的语言建模能力与下游任务性能,实现更快速且更优越的训练效果。在14亿和70亿参数模型上的大量实验验证了该范式的效率与有效性。值得注意的是,当基于Pile数据集训练14亿参数模型时,即使使用比目标模型参数少10倍的预训练模型,本方法仍能实现最高1.6倍的加速效果,且下游任务准确率提升近5%。
English
As Large Language Models (LLMs) achieve remarkable empirical success through scaling model and data size, pretraining has become increasingly critical yet computationally prohibitive, hindering rapid development. Despite the availability of numerous pretrained LLMs developed at significant computational expense, a fundamental real-world question remains underexplored: Can we leverage existing small pretrained models to accelerate the training of larger models? In this paper, we propose a Late-to-Early Training (LET) paradigm that enables LLMs to explicitly learn later knowledge in earlier steps and earlier layers. The core idea is to guide the early layers of an LLM during early training using representations from the late layers of a pretrained (i.e. late training phase) model. We identify two key mechanisms that drive LET's effectiveness: late-to-early-step learning and late-to-early-layer learning. These mechanisms significantly accelerate training convergence while robustly enhancing both language modeling capabilities and downstream task performance, enabling faster training with superior performance. Extensive experiments on 1.4B and 7B parameter models demonstrate LET's efficiency and effectiveness. Notably, when training a 1.4B LLM on the Pile dataset, our method achieves up to 1.6times speedup with nearly 5\% improvement in downstream task accuracy compared to standard training, even when using a pretrained model with 10times fewer parameters than the target model.