言語モデリングのためのプロキシ圧縮

要旨

現代の言語モデルは、UTF-8バイト列を圧縮する外部ロスレス圧縮器によって生成されたトークン列をほぼ独占的に学習対象としており、モデルと圧縮器の結合を引き起こしている。本論文では代理圧縮を提案する。これは推論時にエンドツーエンドの生バイトインターフェースを提供しつつ、圧縮入力の効率性を維持する新しい学習手法である。学習時には、一つの言語モデルが生バイト列と外部圧縮器によって生成された圧縮表現の両方で共同的に学習され、圧縮列と生バイトの内部的な対応関係を獲得する。この対応関係により、推論時には破棄される圧縮入力を主たる学習データとしながらも、両形式間の強力な転移が可能となる。コード言語モデリングにおける大規模実験により、代理圧縮が学習効率を大幅に向上させ、固定計算予算下で純粋なバイトレベルのベースラインを有意に上回ることを実証した。モデル規模が大きくなるにつれてこれらの利得はより顕著になり、代理圧縮で学習したモデルは生バイトのみを操作しバイトレベルモデリングの頑健性を保持したまま、トークナイザ手法に匹敵あるいは凌駕する性能を達成する。

English

Modern language models are trained almost exclusively on token sequences produced by a fixed tokenizer, an external lossless compressor often over UTF-8 byte sequences, thereby coupling the model to that compressor. This work introduces proxy compression, an alternative training scheme that preserves the efficiency benefits of compressed inputs while providing an end-to-end, raw-byte interface at inference time. During training, one language model is jointly trained on raw byte sequences and compressed views generated by external compressors; through the process, the model learns to internally align compressed sequences and raw bytes. This alignment enables strong transfer between the two formats, even when training predominantly on compressed inputs which are discarded at inference. Extensive experiments on code language modeling demonstrate that proxy compression substantially improves training efficiency and significantly outperforms pure byte-level baselines given fixed compute budgets. As model scale increases, these gains become more pronounced, and proxy-trained models eventually match or rival tokenizer approaches, all while operating solely on raw bytes and retaining the inherent robustness of byte-level modeling.

言語モデリングのためのプロキシ圧縮

Proxy Compression for Language Modeling

要旨

Support