CLEX：大型語言模型的連續長度外推

摘要

基於Transformer的大型語言模型（LLMs）是自然語言處理任務中的開創性進展，然而，它們卓越的能力僅限於Transformer的預設上下文窗口內。位置嵌入（PE）縮放方法雖然能有效地將上下文窗口延伸至特定長度，但在其外推能力方面顯示出顯著的局限性，或者在上下文窗口內部分性能上作出犧牲。長度外推方法雖然在理論上能夠將上下文窗口延伸至超出訓練序列長度，但在實際長上下文應用中常常表現不佳。為應對這些挑戰，我們提出了適用於LLMs的連續長度外推（CLEX）。我們將PE縮放方法泛化為模擬連續動態，通過對長度縮放因子應用常微分方程，從而克服了為特定長度設計的當前PE縮放方法的限制。此外，通過將動態擴展至超出訓練序列長度的所需上下文長度，CLEX有助於在實際任務中實現出色的長度外推表現。我們展示了CLEX可以無縫地融入配備旋轉位置嵌入的LLMs，如LLaMA和GPT-NeoX，對訓練和推理延遲幾乎沒有影響。實驗結果顯示，CLEX能夠有效地將上下文窗口延伸至訓練長度的4倍以上或接近8倍，而性能不會下降。此外，在實際的LongBench基準測試中，我們在4k長度上訓練的模型展現出與在上下文長度高達32k的最先進開源模型相競爭的性能。

English

Transformer-based Large Language Models (LLMs) are pioneering advances in many natural language processing tasks, however, their exceptional capabilities are restricted within the preset context window of Transformer. Position Embedding (PE) scaling methods, while effective in extending the context window to a specific length, demonstrate either notable limitations in their extrapolation abilities or sacrificing partial performance within the context window. Length extrapolation methods, although theoretically capable of extending the context window beyond the training sequence length, often underperform in practical long-context applications. To address these challenges, we propose Continuous Length EXtrapolation (CLEX) for LLMs. We generalise the PE scaling approaches to model the continuous dynamics by ordinary differential equations over the length scaling factor, thereby overcoming the constraints of current PE scaling methods designed for specific lengths. Moreover, by extending the dynamics to desired context lengths beyond the training sequence length, CLEX facilitates the length extrapolation with impressive performance in practical tasks. We demonstrate that CLEX can be seamlessly incorporated into LLMs equipped with Rotary Position Embedding, such as LLaMA and GPT-NeoX, with negligible impact on training and inference latency. Experimental results reveal that CLEX can effectively extend the context window to over 4x or almost 8x training length, with no deterioration in performance. Furthermore, when evaluated on the practical LongBench benchmark, our model trained on a 4k length exhibits competitive performance against state-of-the-art open-source models trained on context lengths up to 32k.

CLEX：大型語言模型的連續長度外推

CLEX: Continuous Length Extrapolation for Large Language Models

摘要

Support