過剰トークン化されたトランスフォーマー：一般的に語彙のスケーリングが価値がある

要旨

トークン化は大規模言語モデル（LLMs）の基本的な要素ですが、そのモデルのスケーリングとパフォーマンスへの影響は完全には探求されていません。本論文では、入力と出力の語彙を分離して言語モデリングのパフォーマンスを向上させる革新的なフレームワークである「Over-Tokenized Transformers」を紹介します。具体的には、当該手法は入力語彙を拡大してマルチグラムトークンを活用します。広範な実験を通じて、入力語彙サイズとトレーニング損失との対数線形関係を明らかにし、モデルサイズに関わらず、より大きな入力語彙がモデルのパフォーマンスを一貫して向上させることを示しました。大規模な入力語彙を使用することで、追加コストなしに倍のサイズの基準線に匹敵するパフォーマンスを達成しました。我々の結果は、スケーリング則におけるトークン化の重要性を強調し、トークナイザー設計のための実用的な洞察を提供し、より効率的で強力なLLMsの道を開くものです。

English

Tokenization is a fundamental component of large language models (LLMs), yet its influence on model scaling and performance is not fully explored. In this paper, we introduce Over-Tokenized Transformers, a novel framework that decouples input and output vocabularies to improve language modeling performance. Specifically, our approach scales up input vocabularies to leverage multi-gram tokens. Through extensive experiments, we uncover a log-linear relationship between input vocabulary size and training loss, demonstrating that larger input vocabularies consistently enhance model performance, regardless of model size. Using a large input vocabulary, we achieve performance comparable to double-sized baselines with no additional cost. Our findings highlight the importance of tokenization in scaling laws and provide practical insight for tokenizer design, paving the way for more efficient and powerful LLMs.

過剰トークン化されたトランスフォーマー：一般的に語彙のスケーリングが価値がある

Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling

要旨

Support