大規模言語モデルのためのパッチレベルトレーニング

要旨

大規模言語モデル（LLMs）が言語理解と生成において顕著な進歩を遂げる中、その学習効率が重要な課題となっています。従来、LLMsはシーケンス内の次のトークンを予測するように訓練されてきました。トークンレベルの学習は成功を収めているものの、膨大な数のトークンを処理する必要があるため、計算コストが非常に高くなります。この問題を緩和するため、本論文ではLLMsのためのパッチレベル学習を提案します。これは、複数のトークンを1つのパッチに圧縮することでシーケンス長を短縮するものです。パッチレベル学習では、より短いパッチシーケンスを言語モデルに入力し、次のパッチを予測するように訓練することで、大部分の学習データを大幅に削減された計算コストで処理します。その後、モデルは推論モードに合わせるため、残りの学習データに対してトークンレベルの学習を継続します。多様なモデル（370M-2.7Bパラメータ）での実験により、パッチレベル学習はトークンレベル学習と比較してモデル性能を損なうことなく、全体の計算コストを0.5倍に削減できることが示されました。ソースコード: https://github.com/shaochenze/PatchTrain。

English

As Large Language Models (LLMs) achieve remarkable progress in language understanding and generation, their training efficiency has become a critical concern. Traditionally, LLMs are trained to predict the next token in a sequence. Despite the success of token-level training, it suffers from considerable computational costs due to the need to process an extensive number of tokens. To mitigate this issue, this paper introduces patch-level training for LLMs, which reduces the sequence length by compressing multiple tokens into a single patch. During patch-level training, we feed the language model shorter sequences of patches and train it to predict the next patch, thereby processing the majority of the training data at a significantly reduced computational cost. Following this, the model continues token-level training on the remaining training data to align with the inference mode. Experiments on a diverse range of models (370M-2.7B parameters) demonstrate that patch-level training can reduce overall computational costs to 0.5times, without compromising the model performance compared to token-level training. Source code: https://github.com/shaochenze/PatchTrain.

大規模言語モデルのためのパッチレベルトレーニング

Patch-Level Training for Large Language Models

要旨

Support