CLEX:大型語言模型的連續長度外推
CLEX: Continuous Length Extrapolation for Large Language Models
October 25, 2023
作者: Guanzheng Chen, Xin Li, Zaiqiao Meng, Shangsong Liang, Lidong Bing
cs.AI
摘要
基於Transformer的大型語言模型(LLMs)是自然語言處理任務中的開創性進展,然而,它們卓越的能力僅限於Transformer的預設上下文窗口內。位置嵌入(PE)縮放方法雖然能有效地將上下文窗口延伸至特定長度,但在其外推能力方面顯示出顯著的局限性,或者在上下文窗口內部分性能上作出犧牲。長度外推方法雖然在理論上能夠將上下文窗口延伸至超出訓練序列長度,但在實際長上下文應用中常常表現不佳。為應對這些挑戰,我們提出了適用於LLMs的連續長度外推(CLEX)。我們將PE縮放方法泛化為模擬連續動態,通過對長度縮放因子應用常微分方程,從而克服了為特定長度設計的當前PE縮放方法的限制。此外,通過將動態擴展至超出訓練序列長度的所需上下文長度,CLEX有助於在實際任務中實現出色的長度外推表現。我們展示了CLEX可以無縫地融入配備旋轉位置嵌入的LLMs,如LLaMA和GPT-NeoX,對訓練和推理延遲幾乎沒有影響。實驗結果顯示,CLEX能夠有效地將上下文窗口延伸至訓練長度的4倍以上或接近8倍,而性能不會下降。此外,在實際的LongBench基準測試中,我們在4k長度上訓練的模型展現出與在上下文長度高達32k的最先進開源模型相競爭的性能。
English
Transformer-based Large Language Models (LLMs) are pioneering advances in
many natural language processing tasks, however, their exceptional capabilities
are restricted within the preset context window of Transformer. Position
Embedding (PE) scaling methods, while effective in extending the context window
to a specific length, demonstrate either notable limitations in their
extrapolation abilities or sacrificing partial performance within the context
window. Length extrapolation methods, although theoretically capable of
extending the context window beyond the training sequence length, often
underperform in practical long-context applications. To address these
challenges, we propose Continuous Length EXtrapolation (CLEX) for LLMs. We
generalise the PE scaling approaches to model the continuous dynamics by
ordinary differential equations over the length scaling factor, thereby
overcoming the constraints of current PE scaling methods designed for specific
lengths. Moreover, by extending the dynamics to desired context lengths beyond
the training sequence length, CLEX facilitates the length extrapolation with
impressive performance in practical tasks. We demonstrate that CLEX can be
seamlessly incorporated into LLMs equipped with Rotary Position Embedding, such
as LLaMA and GPT-NeoX, with negligible impact on training and inference
latency. Experimental results reveal that CLEX can effectively extend the
context window to over 4x or almost 8x training length, with no deterioration
in performance. Furthermore, when evaluated on the practical LongBench
benchmark, our model trained on a 4k length exhibits competitive performance
against state-of-the-art open-source models trained on context lengths up to
32k.