利用詞元疊加的高效預訓練

摘要

大型語言模型的預訓練通常在規模上成本過高且效率低下，需要進行複雜且侵入性的修改才能實現高數據吞吐量。在本工作中，我們提出了一種名為Token-疊加訓練（TST）的簡便即插即用方法，在不修改並行化、優化器、分詞器、數據或模型架構的情況下，顯著提升了預訓練期間每FLOPs的數據吞吐量。TST分兩個階段進行：（i）高效疊加階段，我們將多個連續的Token合併為一個集合，並使用多熱交叉熵（MCE）目標進行訓練；（ii）恢復階段，我們恢復為標準訓練。我們在2.7億和6億參數規模上對TST進行了廣泛評估，並在3B和10B A1B混合專家模型上進行了驗證，結果表明TST在不同設定下具有極高的穩健性。最終，TST在損失函數和下遊評估指標上持續優於基線，且在等損失條件下，TST在10B A1B規模上實現了總預訓練時間最多減少2.5倍的效果。

English

Pre-training of Large Language Models is often prohibitively expensive and inefficient at scale, requiring complex and invasive modifications in order to achieve high data throughput. In this work, we present Token-Superposition Training (TST), a simple drop-in method that significantly improves the data throughput per FLOPs during pre-training without modifying the parallelism, optimizer, tokenizer, data, or model architecture. TST is done in two phases: (i) A highly efficient superposition phase where we combine many contiguous tokens into one bag and train using a multi-hot cross-entropy (MCE) objective, and (ii) a recovery phase where we revert back to standard training. We extensively evaluate TST on the scale of 270M and 600M parameters and validate on 3B and a 10B A1B mixture of experts model, demonstrating that it is highly robust in different settings. Ultimately, TST consistently outperforms baseline loss and downstream evaluations, and under equal-loss settings, TST yields up to a 2.5x reduction in total pre-training time at the 10B A1B scale.

利用詞元疊加的高效預訓練

Efficient Pre-Training with Token Superposition

摘要

Support