基于词元叠加的高效预训练

摘要

大型语言模型的预训练通常因成本过高且规模化效率低下而难以实施，需要复杂且侵入性的修改才能实现高数据吞吐量。本文提出令牌叠加训练（TST），这是一种简单的即插即用方法，无需改动并行策略、优化器、分词器、数据或模型架构，即可显著提升预训练期间每FLOPs的数据吞吐量。TST分两阶段进行：（i）高效叠加阶段——将多个连续令牌合并为一个包，并采用多热交叉熵（MCE）目标进行训练；（ii）恢复阶段——回归标准训练流程。我们在270M和600M参数规模上对TST进行了全面评估，并在3B及10B A1B混合专家模型上验证其有效性，结果表明该方法在不同设置下均表现出高度鲁棒性。最终，TST在损失函数和下游评估指标上持续优于基线，且在等损失条件下，10B A1B规模的TST可使总预训练时间减少高达2.5倍。

English

Pre-training of Large Language Models is often prohibitively expensive and inefficient at scale, requiring complex and invasive modifications in order to achieve high data throughput. In this work, we present Token-Superposition Training (TST), a simple drop-in method that significantly improves the data throughput per FLOPs during pre-training without modifying the parallelism, optimizer, tokenizer, data, or model architecture. TST is done in two phases: (i) A highly efficient superposition phase where we combine many contiguous tokens into one bag and train using a multi-hot cross-entropy (MCE) objective, and (ii) a recovery phase where we revert back to standard training. We extensively evaluate TST on the scale of 270M and 600M parameters and validate on 3B and a 10B A1B mixture of experts model, demonstrating that it is highly robust in different settings. Ultimately, TST consistently outperforms baseline loss and downstream evaluations, and under equal-loss settings, TST yields up to a 2.5x reduction in total pre-training time at the 10B A1B scale.

基于词元叠加的高效预训练

Efficient Pre-Training with Token Superposition

摘要

Support