晚至早训练法:让大语言模型提前学习,实现更快速高效的提升
Late-to-Early Training: LET LLMs Learn Earlier, So Faster and Better
February 5, 2026
作者: Ji Zhao, Yufei Gu, Shitong Shao, Xun Zhou, Liang Xiang, Zeke Xie
cs.AI
摘要
随着大型语言模型(LLM)通过扩展模型与数据规模取得显著实证成果,预训练的重要性与日俱增,但其计算成本已形成发展瓶颈。尽管已有众多耗费巨量计算资源训练的预训练模型,一个关键的现实问题仍待探索:能否利用现有小型预训练模型加速大模型训练?本文提出"后训前导"(LET)训练范式,使LLM能够在早期训练阶段和浅层网络中显式学习深层知识。其核心思想是:在训练初期,利用已完成训练(即处于后期阶段)模型的深层表征来引导目标模型的浅层网络学习。我们发现LET的有效性源于两大机制:后阶段至前阶段的知识迁移与深层至浅层的表征引导。这些机制在显著加速训练收敛的同时,持续增强模型的语言建模能力与下游任务表现,实现更高效的高性能训练。在14亿和70亿参数模型上的大量实验验证了LET的高效性。值得注意的是,当使用参数量仅为目标模型十分之一的预训练模型,在Pile数据集上训练14亿参数的LLM时,我们的方法可实现1.6倍加速,且下游任务准确率提升近5%,显著优于标准训练方法。
English
As Large Language Models (LLMs) achieve remarkable empirical success through scaling model and data size, pretraining has become increasingly critical yet computationally prohibitive, hindering rapid development. Despite the availability of numerous pretrained LLMs developed at significant computational expense, a fundamental real-world question remains underexplored: Can we leverage existing small pretrained models to accelerate the training of larger models? In this paper, we propose a Late-to-Early Training (LET) paradigm that enables LLMs to explicitly learn later knowledge in earlier steps and earlier layers. The core idea is to guide the early layers of an LLM during early training using representations from the late layers of a pretrained (i.e. late training phase) model. We identify two key mechanisms that drive LET's effectiveness: late-to-early-step learning and late-to-early-layer learning. These mechanisms significantly accelerate training convergence while robustly enhancing both language modeling capabilities and downstream task performance, enabling faster training with superior performance. Extensive experiments on 1.4B and 7B parameter models demonstrate LET's efficiency and effectiveness. Notably, when training a 1.4B LLM on the Pile dataset, our method achieves up to 1.6times speedup with nearly 5\% improvement in downstream task accuracy compared to standard training, even when using a pretrained model with 10times fewer parameters than the target model.