CLEX：大型语言模型的连续长度外推

摘要

基于Transformer的大型语言模型（LLMs）是自然语言处理任务中的开创性进展，然而，它们卓越的能力受限于Transformer的预设上下文窗口。位置嵌入（PE）缩放方法虽然有效地将上下文窗口扩展到特定长度，但在其外推能力方面存在明显局限性，或者在上下文窗口内部分性能上有所牺牲。长度外推方法虽然在理论上能够将上下文窗口扩展到超出训练序列长度的范围，但在实际长上下文应用中往往表现不佳。为了解决这些挑战，我们提出了适用于LLMs的连续长度外推（CLEX）方法。我们将PE缩放方法推广为通过长度缩放因子上的常微分方程来建模连续动态，从而克服了目前为特定长度设计的PE缩放方法的限制。此外，通过将动态扩展到超出训练序列长度的期望上下文长度，CLEX在实际任务中展现出了出色的长度外推性能。我们证明，CLEX可以无缝地整合到配备旋转位置嵌入的LLMs中，例如LLaMA和GPT-NeoX，对训练和推断延迟几乎没有影响。实验结果显示，CLEX可以有效地将上下文窗口扩展到超过4倍或接近8倍的训练长度，而性能不会下降。此外，在实际的LongBench基准测试中，我们的模型在4k长度上训练后，与在最高达32k上下文长度训练的最先进开源模型相比表现出竞争力。

English

Transformer-based Large Language Models (LLMs) are pioneering advances in many natural language processing tasks, however, their exceptional capabilities are restricted within the preset context window of Transformer. Position Embedding (PE) scaling methods, while effective in extending the context window to a specific length, demonstrate either notable limitations in their extrapolation abilities or sacrificing partial performance within the context window. Length extrapolation methods, although theoretically capable of extending the context window beyond the training sequence length, often underperform in practical long-context applications. To address these challenges, we propose Continuous Length EXtrapolation (CLEX) for LLMs. We generalise the PE scaling approaches to model the continuous dynamics by ordinary differential equations over the length scaling factor, thereby overcoming the constraints of current PE scaling methods designed for specific lengths. Moreover, by extending the dynamics to desired context lengths beyond the training sequence length, CLEX facilitates the length extrapolation with impressive performance in practical tasks. We demonstrate that CLEX can be seamlessly incorporated into LLMs equipped with Rotary Position Embedding, such as LLaMA and GPT-NeoX, with negligible impact on training and inference latency. Experimental results reveal that CLEX can effectively extend the context window to over 4x or almost 8x training length, with no deterioration in performance. Furthermore, when evaluated on the practical LongBench benchmark, our model trained on a 4k length exhibits competitive performance against state-of-the-art open-source models trained on context lengths up to 32k.

CLEX：大型语言模型的连续长度外推

CLEX: Continuous Length Extrapolation for Large Language Models

摘要

Support