語彙カリキュラムを活用したLLMの事前学習のスケーリング

要旨

現代の言語モデルは、人間の言語学習で観察される適応的語彙獲得とは対照的に、事前学習前に固定された静的語彙に依存しています。このギャップを埋めるために、語彙カリキュラム学習を導入します。このアプローチは、語彙サイズに対する対数線形スケーリングの利点を持ち、事前学習の効率を向上させます。当社の手法は、エントロピーによる語彙拡張とモデル最適化を交互に行い、モデルが異なるトークン化の粒度間で移転可能な表現を学習できるようにします。このアプローチは、最適な計算割り当てパターンを自然に生み出します。つまり、より長いトークンは予測可能なコンテンツを捉え、一方でより短いトークンはより複雑で予測が難しい文脈に焦点を当てます。小規模のGPTモデルに対する実験は、スケーリング効率の向上を示し、動的トークン化の効果を補強します。当社のコードを公開して、さらなる研究をサポートし、より大規模なモデルや多様な領域への実験拡大を計画しています。

English

Modern language models rely on static vocabularies, fixed before pretraining, in contrast to the adaptive vocabulary acquisition observed in human language learning. To bridge this gap, we introduce vocabulary curriculum learning, an approach that improves pretraining efficiency with log-linear scaling gains relative to vocabulary size. Our method alternates between entropy-guided vocabulary expansion and model optimization, enabling models to learn transferable representations across diverse tokenization granularities. This approach naturally gives rise to an optimal computation allocation pattern: longer tokens capture predictable content, while shorter tokens focus on more complex, harder-to-predict contexts. Experiments on small-scale GPT models demonstrate improved scaling efficiency, reinforcing the effectiveness of dynamic tokenization. We release our code to support further research and plan to extend our experiments to larger models and diverse domains.

語彙カリキュラムを活用したLLMの事前学習のスケーリング

Scaling LLM Pre-training with Vocabulary Curriculum

要旨

Support