トークン重ね合わせを用いた効率的な事前学習

要旨

大規模言語モデルの事前学習は、多くの場合、スケールが大きくなるにつれてコストが過大となり非効率であり、高いデータスループットを達成するためには複雑で侵襲的な修正が必要となる。本稿では、並列化、最適化手法、トークナイザ、データ、モデルアーキテクチャを変更することなく、事前学習中のFLOPsあたりのデータスループットを大幅に向上させる、シンプルなドロップイン手法であるトークン重ね合わせ学習（TST）を提案する。TSTは2つのフェーズで構成される。(i) 効率的な重ね合わせフェーズでは、連続する複数のトークンを1つのバッグにまとめ、マルチホット交差エントロピー（MCE）目的関数を用いて学習を行う。(ii) 回復フェーズでは、標準的な学習に戻す。我々はTSTを270Mおよび600Mパラメータの規模で広範囲に評価し、3Bおよび10B A1B混合エキスパートモデルで検証した結果、さまざまな設定において高いロバスト性を示すことを実証した。最終的にTSTは、ベースラインの損失および下流評価において一貫して優れた性能を示し、同等損失の設定では、10B A1Bスケールにおいて総事前学習時間を最大2.5倍削減する。

English

Pre-training of Large Language Models is often prohibitively expensive and inefficient at scale, requiring complex and invasive modifications in order to achieve high data throughput. In this work, we present Token-Superposition Training (TST), a simple drop-in method that significantly improves the data throughput per FLOPs during pre-training without modifying the parallelism, optimizer, tokenizer, data, or model architecture. TST is done in two phases: (i) A highly efficient superposition phase where we combine many contiguous tokens into one bag and train using a multi-hot cross-entropy (MCE) objective, and (ii) a recovery phase where we revert back to standard training. We extensively evaluate TST on the scale of 270M and 600M parameters and validate on 3B and a 10B A1B mixture of experts model, demonstrating that it is highly robust in different settings. Ultimately, TST consistently outperforms baseline loss and downstream evaluations, and under equal-loss settings, TST yields up to a 2.5x reduction in total pre-training time at the 10B A1B scale.

トークン重ね合わせを用いた効率的な事前学習

Efficient Pre-Training with Token Superposition

要旨

Support