数据工程：将语言模型扩展至128K上下文

摘要

我们研究了持续预训练方法，用于将语言模型的上下文长度扩展到128K，重点关注数据工程。我们假设长上下文建模，特别是利用任意输入位置信息的能力，是大规模预训练过程中已经获得的能力，而且这种能力可以通过在适当数据混合上进行轻量级持续预训练，很容易地扩展到远远长于训练过程中所见的上下文（例如，从4K扩展到128K）。我们研究了持续预训练的数据数量和质量：（1）对于数量，我们表明5亿至50亿标记足以使模型能够在128K上下文中的任何位置检索信息；（2）对于质量，我们的结果同样强调领域平衡和长度上采样。具体而言，我们发现在某些领域（如书籍）上简单上采样更长的数据，这是现有工作的常见做法，会导致次优性能，而平衡的领域混合很重要。我们证明，在这些数据的10亿至50亿标记上对整个模型进行持续预训练是将语言模型的上下文长度扩展到128K的一种有效且经济实惠的策略。我们的方法胜过强大的开源长上下文模型，并缩小了与GPT-4 128K等前沿模型之间的差距。

English

We study the continual pretraining recipe for scaling language models' context lengths to 128K, with a focus on data engineering. We hypothesize that long context modeling, in particular the ability to utilize information at arbitrary input locations, is a capability that is mostly already acquired through large-scale pretraining, and that this capability can be readily extended to contexts substantially longer than seen during training~(e.g., 4K to 128K) through lightweight continual pretraining on appropriate data mixture. We investigate the quantity and quality of the data for continual pretraining: (1) for quantity, we show that 500 million to 5 billion tokens are enough to enable the model to retrieve information anywhere within the 128K context; (2) for quality, our results equally emphasize domain balance and length upsampling. Concretely, we find that naively upsampling longer data on certain domains like books, a common practice of existing work, gives suboptimal performance, and that a balanced domain mixture is important. We demonstrate that continual pretraining of the full model on 1B-5B tokens of such data is an effective and affordable strategy for scaling the context length of language models to 128K. Our recipe outperforms strong open-source long-context models and closes the gap to frontier models like GPT-4 128K.

数据工程：将语言模型扩展至128K上下文

Data Engineering for Scaling Language Models to 128K Context

摘要

Support