基于长上下文的基础模型有效扩展
Effective Long-Context Scaling of Foundation Models
September 27, 2023
作者: Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, Madian Khabsa, Han Fang, Yashar Mehdad, Sharan Narang, Kshitiz Malik, Angela Fan, Shruti Bhosale, Sergey Edunov, Mike Lewis, Sinong Wang, Hao Ma
cs.AI
摘要
我们提出了一系列支持长上下文的LLM模型,能够有效地处理长达32,768个标记的上下文窗口。我们的模型系列是通过从Llama 2开始的持续预训练构建的,训练序列更长,数据集中长文本的数量增加。我们对语言建模、合成上下文探测任务以及各种研究基准进行了广泛评估。在研究基准上,我们的模型在大多数常规任务上取得了持续改进,在长上下文任务上明显优于Llama 2。值得注意的是,通过一种经济高效的指导调整程序,无需人工标注的长指导数据,70B变体已经能够在一系列长上下文任务中超越gpt-3.5-turbo-16k的整体性能。除了这些结果,我们还对我们方法的各个组成部分进行了深入分析。我们深入研究了Llama的位置编码,并讨论了其在建模长依赖性方面的局限性。我们还研究了预训练过程中各种设计选择的影响,包括数据混合和序列长度的训练课程 - 我们的消融实验表明,在预训练数据集中有大量长文本并不是实现强大性能的关键,我们经验性地验证了长上下文持续预训练相比于从头开始使用长序列进行预训练更加高效且同样有效。
English
We present a series of long-context LLMs that support effective context
windows of up to 32,768 tokens. Our model series are built through continual
pretraining from Llama 2 with longer training sequences and on a dataset where
long texts are upsampled. We perform extensive evaluation on language modeling,
synthetic context probing tasks, and a wide range of research benchmarks. On
research benchmarks, our models achieve consistent improvements on most regular
tasks and significant improvements on long-context tasks over Llama 2. Notably,
with a cost-effective instruction tuning procedure that does not require
human-annotated long instruction data, the 70B variant can already surpass
gpt-3.5-turbo-16k's overall performance on a suite of long-context tasks.
Alongside these results, we provide an in-depth analysis on the individual
components of our method. We delve into Llama's position encodings and discuss
its limitation in modeling long dependencies. We also examine the impact of
various design choices in the pretraining process, including the data mix and
the training curriculum of sequence lengths -- our ablation experiments suggest
that having abundant long texts in the pretrain dataset is not the key to
achieving strong performance, and we empirically verify that long context
continual pretraining is more efficient and similarly effective compared to
pretraining from scratch with long sequences.