어휘 커리큘럼을 통한 대규모 언어 모델 사전 학습 확장

초록

현대 언어 모델은 사전 학습 전에 고정된 정적 어휘 사전에 의존하는 반면, 인간의 언어 학습에서는 적응형 어휘 습득이 관찰됩니다. 이러한 차이를 해소하기 위해, 우리는 어휘 커리큘럼 학습(vocabulary curriculum learning)을 소개합니다. 이 접근법은 어휘 크기에 대해 로그-선형(log-linear) 스케일링 이득을 통해 사전 학습 효율성을 향상시킵니다. 우리의 방법은 엔트로피 기반 어휘 확장과 모델 최적화를 번갈아 수행함으로써, 다양한 토큰화 세분화 수준에서 전이 가능한 표현을 학습할 수 있도록 합니다. 이 접근법은 자연스럽게 최적의 계산 할당 패턴을 도출합니다: 긴 토큰은 예측 가능한 내용을 포착하고, 짧은 토큰은 더 복잡하고 예측하기 어려운 맥락에 집중합니다. 소규모 GPT 모델에 대한 실험은 스케일링 효율성의 개선을 보여주며, 동적 토큰화의 효과를 입증합니다. 우리는 추가 연구를 지원하기 위해 코드를 공개하고, 더 큰 모델과 다양한 도메인으로 실험을 확장할 계획입니다.

English

Modern language models rely on static vocabularies, fixed before pretraining, in contrast to the adaptive vocabulary acquisition observed in human language learning. To bridge this gap, we introduce vocabulary curriculum learning, an approach that improves pretraining efficiency with log-linear scaling gains relative to vocabulary size. Our method alternates between entropy-guided vocabulary expansion and model optimization, enabling models to learn transferable representations across diverse tokenization granularities. This approach naturally gives rise to an optimal computation allocation pattern: longer tokens capture predictable content, while shorter tokens focus on more complex, harder-to-predict contexts. Experiments on small-scale GPT models demonstrate improved scaling efficiency, reinforcing the effectiveness of dynamic tokenization. We release our code to support further research and plan to extend our experiments to larger models and diverse domains.

어휘 커리큘럼을 통한 대규모 언어 모델 사전 학습 확장

Scaling LLM Pre-training with Vocabulary Curriculum

초록

Support