토큰 중첩을 활용한 효율적인 사전 학습

초록

대규모 언어 모델의 사전 학습은 종종 엄청난 비용이 들며 확장 시 비효율적이어서, 높은 데이터 처리량을 달성하기 위해 복잡하고 침습적인 수정이 필요하다. 본 연구에서는 병렬 처리, 최적화기, 토크나이저, 데이터, 또는 모델 아키텍처를 수정하지 않고 사전 학습 중 FLOPs당 데이터 처리량을 크게 향상시키는 간편한 대체 방법인 토큰-중첩 학습(TST)을 제안한다. TST는 두 단계로 수행된다: (i) 연속된 많은 토큰을 하나의 배치(bag)로 결합하고 멀티-핫 교차 엔트로피(MCE) 목적 함수를 사용하여 학습하는 매우 효율적인 중첩 단계, (ii) 표준 학습으로 되돌리는 복구 단계. 우리는 270M 및 600M 매개변수 규모에서 TST를 광범위하게 평가하고, 3B 및 10B A1B 혼합 전문가 모델에서 검증하여 다양한 설정에서 높은 견고성을 보여준다. 궁극적으로 TST는 기준 손실 및 하류 평가에서 일관되게 우수한 성능을 보이며, 동일한 손실 조건에서 10B A1B 규모의 총 사전 학습 시간을 최대 2.5배까지 감소시킨다.

English

Pre-training of Large Language Models is often prohibitively expensive and inefficient at scale, requiring complex and invasive modifications in order to achieve high data throughput. In this work, we present Token-Superposition Training (TST), a simple drop-in method that significantly improves the data throughput per FLOPs during pre-training without modifying the parallelism, optimizer, tokenizer, data, or model architecture. TST is done in two phases: (i) A highly efficient superposition phase where we combine many contiguous tokens into one bag and train using a multi-hot cross-entropy (MCE) objective, and (ii) a recovery phase where we revert back to standard training. We extensively evaluate TST on the scale of 270M and 600M parameters and validate on 3B and a 10B A1B mixture of experts model, demonstrating that it is highly robust in different settings. Ultimately, TST consistently outperforms baseline loss and downstream evaluations, and under equal-loss settings, TST yields up to a 2.5x reduction in total pre-training time at the 10B A1B scale.

토큰 중첩을 활용한 효율적인 사전 학습

Efficient Pre-Training with Token Superposition

초록

Support