擴展LLM的上下文窗口，使用100個樣本。

摘要

大型語言模型（LLMs）被認為在超出其預先訓練上下文窗口的外推能力方面存在限制，這限制了它們在具有冗長輸入的下游任務中的應用。最近的研究試圖通過修改旋轉位置嵌入（RoPE）來擴展LLMs的上下文窗口，RoPE是一種廣受歡迎的位置編碼方法，被廣泛應用於知名的LLMs，如LLaMA、PaLM和GPT-NeoX。然而，之前的作品如位置插值（PI）和YaRN耗費資源且缺乏比較實驗以評估它們的適用性。在這項工作中，我們確認了LLMs的注意熵（即注意力分數的信息熵）需要保持穩定，並引入了一種新的RoPE擴展方法，該方法結合了調整RoPE的基頻率和縮放注意力對數，以幫助LLMs有效地適應更大的上下文窗口。我們驗證了我們的方法在不同上下文窗口大小上各種上下文需求任務中的微調性能和穩健性的優越性。值得注意的是，我們的方法將LLaMA-2-7B-Chat的上下文窗口擴展到16,384，僅需100個樣本和6個訓練步驟，展示了非凡的效率。最後，我們還探討了數據組成和訓練課程如何影響特定下游任務的上下文窗口擴展，建議以冗長對話微調LLMs作為一個良好的起點。我們在https://github.com/GAIR-NLP/Entropy-ABF 上發布了我們的代碼和SFT數據。

English

Large Language Models (LLMs) are known to have limited extrapolation ability beyond their pre-trained context window, constraining their application in downstream tasks with lengthy inputs. Recent studies have sought to extend LLMs' context window by modifying rotary position embedding (RoPE), a popular position encoding method adopted by well-known LLMs such as LLaMA, PaLM, and GPT-NeoX. However, prior works like Position Interpolation (PI) and YaRN are resource-intensive and lack comparative experiments to assess their applicability. In this work, we identify the inherent need for LLMs' attention entropy (i.e. the information entropy of attention scores) to maintain stability and introduce a novel extension to RoPE which combines adjusting RoPE's base frequency and scaling the attention logits to help LLMs efficiently adapt to a larger context window. We validate the superiority of our method in both fine-tuning performance and robustness across different context window sizes on various context-demanding tasks. Notably, our method extends the context window of LLaMA-2-7B-Chat to 16,384 with only 100 samples and 6 training steps, showcasing extraordinary efficiency. Finally, we also explore how data compositions and training curricula affect context window extension for specific downstream tasks, suggesting fine-tuning LLMs with lengthy conversations as a good starting point. We release our code and SFT data at https://github.com/GAIR-NLP/Entropy-ABF.

擴展LLM的上下文窗口，使用100個樣本。

Extending LLMs' Context Window with 100 Samples

摘要

Support