エンドツーエンド階層的シーケンスモデリングのための動的チャンキング

要旨

近年、言語モデル（LMs）において驚異的な進展が見られ、特定のタスク向けに設計された専門モデルから、生データから全てを学習する強力なアーキテクチャ（例えばTransformer）に基づく汎用モデルへの移行がその主な要因となっている。しかし、トークン化などの前処理ステップは、真のエンドツーエンド基盤モデルへの障壁として残っている。本論文では、動的なチャンキングメカニズムを可能にする新しい技術群を紹介する。このメカニズムは、モデルの他の部分と共に学習される内容および文脈依存のセグメンテーション戦略を自動的に学習する。これを明示的な階層型ネットワーク（H-Net）に組み込むことで、（暗黙的に階層化された）トークン化-LM-デトークン化パイプラインを、完全にエンドツーエンドで学習される単一のモデルに置き換えることができる。計算資源とデータ量が同等の条件下で、バイトレベルで動作する1段階の階層を持つH-Netは、BPEトークン上で動作する強力なTransformer言語モデルを上回る性能を示す。階層を複数段階に繰り返すことで、複数の抽象化レベルをモデル化し、データに対するスケーリングが大幅に向上し、その2倍のサイズのトークンベースのTransformerと同等の性能を達成する。英語で事前学習されたH-Netは、文字レベルの頑健性が大幅に向上し、ヒューリスティックや明示的な監督なしに意味のあるデータ依存のチャンキング戦略を質的に学習する。最後に、H-Netのトークン化パイプラインに対する改善は、中国語やコード、DNA配列（ベースラインに対してデータ効率が約4倍向上）など、トークン化ヒューリスティックが弱い言語やモダリティにおいてさらに顕著であり、未処理データからより良く学習しスケールする真のエンドツーエンドモデルの可能性を示している。

English

Despite incredible progress in language models (LMs) in recent years, largely resulting from moving away from specialized models designed for specific tasks to general models based on powerful architectures (e.g. the Transformer) that learn everything from raw data, pre-processing steps such as tokenization remain a barrier to true end-to-end foundation models. We introduce a collection of new techniques that enable a dynamic chunking mechanism which automatically learns content -- and context -- dependent segmentation strategies learned jointly with the rest of the model. Incorporating this into an explicit hierarchical network (H-Net) allows replacing the (implicitly hierarchical) tokenization-LM-detokenization pipeline with a single model learned fully end-to-end. When compute- and data- matched, an H-Net with one stage of hierarchy operating at the byte level outperforms a strong Transformer language model operating over BPE tokens. Iterating the hierarchy to multiple stages further increases its performance by modeling multiple levels of abstraction, demonstrating significantly better scaling with data and matching a token-based Transformer of twice its size. H-Nets pretrained on English show significantly increased character-level robustness, and qualitatively learn meaningful data-dependent chunking strategies without any heuristics or explicit supervision. Finally, the H-Net's improvement over tokenized pipelines is further increased in languages and modalities with weaker tokenization heuristics, such as Chinese and code, or DNA sequences (nearly 4x improvement in data efficiency over baselines), showing the potential of true end-to-end models that learn and scale better from unprocessed data.

エンドツーエンド階層的シーケンスモデリングのための動的チャンキング

Dynamic Chunking for End-to-End Hierarchical Sequence Modeling

要旨

Support