바이트 수준 시뮬레이션을 통한 언어 모델 훈련에서 하위 단어 토큰화 이점의 분리

초록

서브워드 토크나이제이션은 현대의 대규모 언어 모델(LLM)에서 필수적인 부분이지만, 학습 효율성과 모델 성능에 대한 구체적인 기여도는 여전히 잘 이해되지 않고 있다. 본 연구에서는 통제된 바이트 수준 사전 학습 파이프라인 내에서 서브워드 토큰화의 효과를 분리하여 고찰한다. 우리는 샘플 처리량, 어휘 확장, 서브워드 경계의 언어적 사전 정보 등 다양한 차원에 걸쳐 가설을 수립하고 검증한다. 이러한 효과를 바이트 수준 설정에서 시뮬레이션함으로써 서브워드 모델이 원시 바이트 모델보다 우수한 이유에 대한 이해를 정교화하고, 향후 바이트 수준 및 서브워드 모델의 사전 학습을 개선하기 위한 통찰력을 제공한다. 구체적으로, 실험 결과는 증가된 학습 처리량의 중요한 역할과 서브워드 경계를 명시적 사전 정보나 귀납적 편향으로 통합하는 것의 중요성을 강조한다.

English

Subword tokenization is an essential part of modern large language models (LLMs), yet its specific contributions to training efficiency and model performance remain poorly understood. In this work, we decouple the effects of subword tokenization by isolating them within a controlled byte-level pretraining pipeline. We formulate and test hypotheses across various dimensions, including sample throughput, vocabulary scaling, and the linguistic prior of subword boundaries. By simulating these effects in a byte-level setting, we refine our understanding of why subword models outperform raw byte models and offer insights to improve the pretraining of future byte-level and subword models. Specifically, our experiments highlight the critical role of increased training throughput and the integration of subword boundaries as either explicit priors or inductive biases.