128Kコンテキストへの言語モデルスケーリングのためのデータエンジニアリング

要旨

我々は、言語モデルのコンテキスト長を128Kにスケーリングするための継続的プレトレーニング手法を、特にデータエンジニアリングに焦点を当てて研究した。我々は、長いコンテキストのモデリング、特に任意の入力位置の情報を活用する能力は、大規模なプレトレーニングを通じて既に大部分が獲得されている能力であり、適切なデータ混合による軽量な継続的プレトレーニングを通じて、トレーニング中に見られたコンテキスト長（例：4K）を大幅に超える長さ（例：128K）に容易に拡張できると仮説を立てた。我々は、継続的プレトレーニングのためのデータの量と質を調査した：(1) 量に関しては、500百万から50億トークンが、モデルが128Kコンテキスト内の任意の位置の情報を取得するのに十分であることを示した；(2) 質に関しては、ドメインのバランスと長さのアップサンプリングが同様に重要であることを結果から示した。具体的には、書籍などの特定のドメインで長いデータを単純にアップサンプリングする既存の手法は最適な性能を発揮せず、バランスの取れたドメイン混合が重要であることを発見した。我々は、1Bから5Bトークンのようなデータを用いたフルモデルの継続的プレトレーニングが、言語モデルのコンテキスト長を128Kにスケーリングするための効果的かつ手頃な戦略であることを実証した。我々の手法は、強力なオープンソースの長文コンテキストモデルを上回り、GPT-4 128Kのような最先端モデルとのギャップを埋めることができた。

English

We study the continual pretraining recipe for scaling language models' context lengths to 128K, with a focus on data engineering. We hypothesize that long context modeling, in particular the ability to utilize information at arbitrary input locations, is a capability that is mostly already acquired through large-scale pretraining, and that this capability can be readily extended to contexts substantially longer than seen during training~(e.g., 4K to 128K) through lightweight continual pretraining on appropriate data mixture. We investigate the quantity and quality of the data for continual pretraining: (1) for quantity, we show that 500 million to 5 billion tokens are enough to enable the model to retrieve information anywhere within the 128K context; (2) for quality, our results equally emphasize domain balance and length upsampling. Concretely, we find that naively upsampling longer data on certain domains like books, a common practice of existing work, gives suboptimal performance, and that a balanced domain mixture is important. We demonstrate that continual pretraining of the full model on 1B-5B tokens of such data is an effective and affordable strategy for scaling the context length of language models to 128K. Our recipe outperforms strong open-source long-context models and closes the gap to frontier models like GPT-4 128K.

128Kコンテキストへの言語モデルスケーリングのためのデータエンジニアリング

Data Engineering for Scaling Language Models to 128K Context

要旨

Support