大型語言模型的補丁級訓練

摘要

隨著大型語言模型（LLMs）在語言理解和生成方面取得顯著進展，其訓練效率已成為一個關鍵問題。傳統上，LLMs 被訓練來預測序列中的下一個標記。儘管標記級別的訓練取得了成功，但由於需要處理大量標記，它面臨著相當大的計算成本。為了緩解這個問題，本文引入了針對LLMs的補丁級別訓練，通過將多個標記壓縮為單個補丁來減少序列長度。在補丁級別訓練期間，我們將輸入語言模型較短的補丁序列，並訓練它來預測下一個補丁，從而以顯著降低的計算成本處理大部分訓練數據。隨後，模型將在剩餘的訓練數據上繼續進行標記級別訓練，以與推理模式保持一致。對各種模型（370M-2.7B參數）的實驗表明，補丁級別訓練可以將整體計算成本降低到0.5倍，而與標記級別訓練相比，並不會影響模型性能。原始碼：https://github.com/shaochenze/PatchTrain。

English

As Large Language Models (LLMs) achieve remarkable progress in language understanding and generation, their training efficiency has become a critical concern. Traditionally, LLMs are trained to predict the next token in a sequence. Despite the success of token-level training, it suffers from considerable computational costs due to the need to process an extensive number of tokens. To mitigate this issue, this paper introduces patch-level training for LLMs, which reduces the sequence length by compressing multiple tokens into a single patch. During patch-level training, we feed the language model shorter sequences of patches and train it to predict the next patch, thereby processing the majority of the training data at a significantly reduced computational cost. Following this, the model continues token-level training on the remaining training data to align with the inference mode. Experiments on a diverse range of models (370M-2.7B parameters) demonstrate that patch-level training can reduce overall computational costs to 0.5times, without compromising the model performance compared to token-level training. Source code: https://github.com/shaochenze/PatchTrain.

大型語言模型的補丁級訓練

Patch-Level Training for Large Language Models

摘要

Support