LongRoPE：將LLM上下文窗口擴展至超過2百萬個標記

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

February 21, 2024

作者: Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, Mao Yang

cs.AI

摘要

大範圍的上下文窗口是大型語言模型（LLMs）中一個理想的特徵。然而，由於高昂的微調成本、長文本的稀缺性以及新標記位置引入的災難性值，目前擴展的上下文窗口僅限於約128k個標記。本文介紹了LongRoPE，首次將預訓練的LLMs的上下文窗口擴展到令人印象深刻的2048k個標記，僅需在256k的訓練長度內進行最多1k次微調步驟，同時保持原始短上下文窗口的性能。這是通過三項關鍵創新實現的：（i）我們識別並利用兩種位置插值中的非均勻性形式，通過高效搜索提供更好的微調初始化，並實現非微調情況下的8倍擴展；（ii）我們引入了一種漸進擴展策略，首先微調256k長度的LLM，然後對微調後的擴展LLM進行第二次位置插值，實現2048k上下文窗口；（iii）我們對8k長度的LongRoPE進行調整，以恢復短上下文窗口的性能。在LLaMA2和Mistral上進行的大量實驗顯示了我們方法的有效性。通過LongRoPE擴展的模型保留了原始架構，僅對位置嵌入進行了輕微修改，並且可以重用大多數現有的優化。

English

Large context window is a desirable feature in large language models (LLMs). However, due to high fine-tuning costs, scarcity of long texts, and catastrophic values introduced by new token positions, current extended context windows are limited to around 128k tokens. This paper introduces LongRoPE that, for the first time, extends the context window of pre-trained LLMs to an impressive 2048k tokens, with up to only 1k fine-tuning steps at within 256k training lengths, while maintaining performance at the original short context window. This is achieved by three key innovations: (i) we identify and exploit two forms of non-uniformities in positional interpolation through an efficient search, providing a better initialization for fine-tuning and enabling an 8x extension in non-fine-tuning scenarios; (ii) we introduce a progressive extension strategy that first fine-tunes a 256k length LLM and then conducts a second positional interpolation on the fine-tuned extended LLM to achieve a 2048k context window; (iii) we readjust LongRoPE on 8k length to recover the short context window performance. Extensive experiments on LLaMA2 and Mistral across various tasks demonstrate the effectiveness of our method. Models extended via LongRoPE retain the original architecture with minor modifications to the positional embedding, and can reuse most pre-existing optimizations.