基礎模型的有效長文本擴展

摘要

我們提出了一系列支援長上下文的LLM模型，能有效處理長達32,768個標記的上下文窗口。我們的模型系列是通過從Llama 2開始持續預訓練，使用更長的訓練序列，以及在一個長文本被上採樣的數據集上構建的。我們對語言建模、合成上下文探測任務以及廣泛的研究基準進行了廣泛評估。在研究基準上，我們的模型在大多數常規任務上實現了一致的改進，在長上下文任務上明顯優於Llama 2。值得注意的是，通過一個成本效益高的指導調整程序，無需人工標註的長指導數據，70B變體已經能夠在一系列長上下文任務上超越gpt-3.5-turbo-16k的整體表現。除了這些結果，我們對我們方法的各個組成部分進行了深入分析。我們深入研究了Llama的位置編碼，並討論了其在建模長依賴性方面的局限性。我們還檢驗了預訓練過程中各種設計選擇的影響，包括數據混合和序列長度的訓練課程--我們的消融實驗表明，在預訓練數據集中擁有豐富的長文本並不是實現強大性能的關鍵，我們在實踐中驗證了長上下文持續預訓練相對於從頭開始使用長序列進行預訓練更有效且同樣有效。

English

We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens. Our model series are built through continual pretraining from Llama 2 with longer training sequences and on a dataset where long texts are upsampled. We perform extensive evaluation on language modeling, synthetic context probing tasks, and a wide range of research benchmarks. On research benchmarks, our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2. Notably, with a cost-effective instruction tuning procedure that does not require human-annotated long instruction data, the 70B variant can already surpass gpt-3.5-turbo-16k's overall performance on a suite of long-context tasks. Alongside these results, we provide an in-depth analysis on the individual components of our method. We delve into Llama's position encodings and discuss its limitation in modeling long dependencies. We also examine the impact of various design choices in the pretraining process, including the data mix and the training curriculum of sequence lengths -- our ablation experiments suggest that having abundant long texts in the pretrain dataset is not the key to achieving strong performance, and we empirically verify that long context continual pretraining is more efficient and similarly effective compared to pretraining from scratch with long sequences.

基礎模型的有效長文本擴展

Effective Long-Context Scaling of Foundation Models

摘要

Support