通过位置插值扩展大型语言模型的上下文窗口

摘要

我们提出了位置插值（PI），它扩展了基于RoPE预训练LLM（如LLaMA模型）的上下文窗口大小，最多可达32768，而只需进行最少的微调（在1000步内），同时在需要长上下文的各种任务上展现出强大的实证结果，包括密码检索、语言建模以及从LLaMA 7B到65B的长文档摘要。与此同时，通过位置插值扩展的模型在其原始上下文窗口内的任务上相对保持了质量。为实现这一目标，位置插值线性地将输入位置索引进行缩放，以匹配原始上下文窗口大小，而不是超出训练过的上下文长度进行外推，这可能导致灾难性高的注意力分数，完全破坏自注意力机制。我们的理论研究表明，插值的上限至少比外推小600倍，进一步展示了其稳定性。通过位置插值扩展的模型保留其原始架构，并且可以重复使用大多数现有的优化和基础设施。

English

We present Position Interpolation (PI) that extends the context window sizes of RoPE-based pretrained LLMs such as LLaMA models to up to 32768 with minimal fine-tuning (within 1000 steps), while demonstrating strong empirical results on various tasks that require long context, including passkey retrieval, language modeling, and long document summarization from LLaMA 7B to 65B. Meanwhile, the extended model by Position Interpolation preserve quality relatively well on tasks within its original context window. To achieve this goal, Position Interpolation linearly down-scales the input position indices to match the original context window size, rather than extrapolating beyond the trained context length which may lead to catastrophically high attention scores that completely ruin the self-attention mechanism. Our theoretical study shows that the upper bound of interpolation is at least sim 600 times smaller than that of extrapolation, further demonstrating its stability. Models extended via Position Interpolation retain its original architecture and can reuse most pre-existing optimization and infrastructure.

通过位置插值扩展大型语言模型的上下文窗口

Extending Context Window of Large Language Models via Positional Interpolation

摘要

Support