數據工程用於將語言模型擴展至128K上下文

摘要

我們研究了持續預訓練技術，以將語言模型的上下文長度擴展至128K，並專注於數據工程。我們假設長篇上下文建模，特別是能夠利用任意輸入位置的資訊的能力，大部分已經通過大規模預訓練獲得，並且這種能力可以通過在適當的數據混合上進行輕量級持續預訓練，輕鬆地擴展到比訓練過程中看到的範圍更長得多的上下文（例如，從4K到128K）。我們研究了持續預訓練的數據的數量和質量：（1）對於數量，我們表明5億至50億標記足以使模型能夠檢索128K上下文中的任何信息；（2）對於質量，我們的結果同樣強調領域平衡和長度上採樣。具體而言，我們發現，對於某些領域（如圖書）的長數據進行天真的上採樣，這是現有工作的常見做法，會導致次優異的性能，而平衡的領域混合則至關重要。我們證明，對這些數據的10億至50億標記進行完整模型的持續預訓練是將語言模型的上下文長度擴展至128K的一種有效且負擔得起的策略。我們的方法優於強大的開源長篇上下文模型，並將差距拉近到像GPT-4 128K這樣的前沿模型。

English

We study the continual pretraining recipe for scaling language models' context lengths to 128K, with a focus on data engineering. We hypothesize that long context modeling, in particular the ability to utilize information at arbitrary input locations, is a capability that is mostly already acquired through large-scale pretraining, and that this capability can be readily extended to contexts substantially longer than seen during training~(e.g., 4K to 128K) through lightweight continual pretraining on appropriate data mixture. We investigate the quantity and quality of the data for continual pretraining: (1) for quantity, we show that 500 million to 5 billion tokens are enough to enable the model to retrieve information anywhere within the 128K context; (2) for quality, our results equally emphasize domain balance and length upsampling. Concretely, we find that naively upsampling longer data on certain domains like books, a common practice of existing work, gives suboptimal performance, and that a balanced domain mixture is important. We demonstrate that continual pretraining of the full model on 1B-5B tokens of such data is an effective and affordable strategy for scaling the context length of language models to 128K. Our recipe outperforms strong open-source long-context models and closes the gap to frontier models like GPT-4 128K.

數據工程用於將語言模型擴展至128K上下文

Data Engineering for Scaling Language Models to 128K Context

摘要

Support