CLEX:大型语言模型的连续长度外推
CLEX: Continuous Length Extrapolation for Large Language Models
October 25, 2023
作者: Guanzheng Chen, Xin Li, Zaiqiao Meng, Shangsong Liang, Lidong Bing
cs.AI
摘要
基于Transformer的大型语言模型(LLMs)是自然语言处理任务中的开创性进展,然而,它们卓越的能力受限于Transformer的预设上下文窗口。位置嵌入(PE)缩放方法虽然有效地将上下文窗口扩展到特定长度,但在其外推能力方面存在明显局限性,或者在上下文窗口内部分性能上有所牺牲。长度外推方法虽然在理论上能够将上下文窗口扩展到超出训练序列长度的范围,但在实际长上下文应用中往往表现不佳。为了解决这些挑战,我们提出了适用于LLMs的连续长度外推(CLEX)方法。我们将PE缩放方法推广为通过长度缩放因子上的常微分方程来建模连续动态,从而克服了目前为特定长度设计的PE缩放方法的限制。此外,通过将动态扩展到超出训练序列长度的期望上下文长度,CLEX在实际任务中展现出了出色的长度外推性能。我们证明,CLEX可以无缝地整合到配备旋转位置嵌入的LLMs中,例如LLaMA和GPT-NeoX,对训练和推断延迟几乎没有影响。实验结果显示,CLEX可以有效地将上下文窗口扩展到超过4倍或接近8倍的训练长度,而性能不会下降。此外,在实际的LongBench基准测试中,我们的模型在4k长度上训练后,与在最高达32k上下文长度训练的最先进开源模型相比表现出竞争力。
English
Transformer-based Large Language Models (LLMs) are pioneering advances in
many natural language processing tasks, however, their exceptional capabilities
are restricted within the preset context window of Transformer. Position
Embedding (PE) scaling methods, while effective in extending the context window
to a specific length, demonstrate either notable limitations in their
extrapolation abilities or sacrificing partial performance within the context
window. Length extrapolation methods, although theoretically capable of
extending the context window beyond the training sequence length, often
underperform in practical long-context applications. To address these
challenges, we propose Continuous Length EXtrapolation (CLEX) for LLMs. We
generalise the PE scaling approaches to model the continuous dynamics by
ordinary differential equations over the length scaling factor, thereby
overcoming the constraints of current PE scaling methods designed for specific
lengths. Moreover, by extending the dynamics to desired context lengths beyond
the training sequence length, CLEX facilitates the length extrapolation with
impressive performance in practical tasks. We demonstrate that CLEX can be
seamlessly incorporated into LLMs equipped with Rotary Position Embedding, such
as LLaMA and GPT-NeoX, with negligible impact on training and inference
latency. Experimental results reveal that CLEX can effectively extend the
context window to over 4x or almost 8x training length, with no deterioration
in performance. Furthermore, when evaluated on the practical LongBench
benchmark, our model trained on a 4k length exhibits competitive performance
against state-of-the-art open-source models trained on context lengths up to
32k.