バイトレベルのシミュレーションによる言語モデル訓練におけるサブワードトークン化の利点の分離

要旨

サブワードトークン化は現代の大規模言語モデル（LLM）において不可欠な要素であるが、訓練効率とモデル性能に対するその具体的な貢献は依然として十分に理解されていない。本研究では、制御されたバイトレベルの事前学習パイプライン内でそれらの効果を分離することにより、サブワードトークン化の影響を切り離して評価する。サンプルスループット、語彙スケーリング、サブワード境界の言語的先行知識など、さまざまな次元にわたって仮説を定式化し検証する。バイトレベルの設定でこれらの効果をシミュレートすることにより、なぜサブワードモデルが生のバイトモデルよりも優れているのかについての理解を深め、将来のバイトレベルモデルおよびサブワードモデルの事前学習を改善するための洞察を提供する。具体的には、実験により、訓練スループットの向上と、サブワード境界を明示的事前分布または帰納的バイアスのいずれかとして統合することの重要性が明らかになる。

English

Subword tokenization is an essential part of modern large language models (LLMs), yet its specific contributions to training efficiency and model performance remain poorly understood. In this work, we decouple the effects of subword tokenization by isolating them within a controlled byte-level pretraining pipeline. We formulate and test hypotheses across various dimensions, including sample throughput, vocabulary scaling, and the linguistic prior of subword boundaries. By simulating these effects in a byte-level setting, we refine our understanding of why subword models outperform raw byte models and offer insights to improve the pretraining of future byte-level and subword models. Specifically, our experiments highlight the critical role of increased training throughput and the integration of subword boundaries as either explicit priors or inductive biases.