대규모 언어 모델을 위한 패치 수준 학습

초록

대규모 언어 모델(LLMs)이 언어 이해 및 생성 분야에서 놀라운 진전을 이루면서, 이들의 학습 효율성이 중요한 관심사로 대두되고 있습니다. 전통적으로 LLMs는 시퀀스 내 다음 토큰을 예측하도록 학습됩니다. 토큰 수준 학습의 성공에도 불구하고, 방대한 수의 토큰을 처리해야 하기 때문에 상당한 계산 비용이 발생합니다. 이 문제를 완화하기 위해, 본 논문은 LLMs를 위한 패치 수준 학습을 소개합니다. 이 방법은 여러 토큰을 단일 패치로 압축하여 시퀀스 길이를 줄입니다. 패치 수준 학습 동안, 우리는 언어 모델에 더 짧은 패치 시퀀스를 입력하고 다음 패치를 예측하도록 학습시켜, 대부분의 학습 데이터를 상당히 감소된 계산 비용으로 처리합니다. 이후, 모델은 추론 모드와 일치하도록 남은 학습 데이터에 대해 토큰 수준 학습을 계속합니다. 다양한 모델(370M-2.7B 파라미터)에 대한 실험 결과, 패치 수준 학습은 토큰 수준 학습과 비교하여 모델 성능을 저하시키지 않으면서 전체 계산 비용을 0.5배로 줄일 수 있음을 보여줍니다. 소스 코드: https://github.com/shaochenze/PatchTrain.

English

As Large Language Models (LLMs) achieve remarkable progress in language understanding and generation, their training efficiency has become a critical concern. Traditionally, LLMs are trained to predict the next token in a sequence. Despite the success of token-level training, it suffers from considerable computational costs due to the need to process an extensive number of tokens. To mitigate this issue, this paper introduces patch-level training for LLMs, which reduces the sequence length by compressing multiple tokens into a single patch. During patch-level training, we feed the language model shorter sequences of patches and train it to predict the next patch, thereby processing the majority of the training data at a significantly reduced computational cost. Following this, the model continues token-level training on the remaining training data to align with the inference mode. Experiments on a diverse range of models (370M-2.7B parameters) demonstrate that patch-level training can reduce overall computational costs to 0.5times, without compromising the model performance compared to token-level training. Source code: https://github.com/shaochenze/PatchTrain.

대규모 언어 모델을 위한 패치 수준 학습

Patch-Level Training for Large Language Models

초록

Support