長食譜：在大型語言模型中實現高效長文本泛化的食譜

摘要

大型語言模型（LLMs）在處理長文本任務時面臨重大挑戰，因為它們在預訓練期間的有效上下文窗口大小有限，這限制了它們對延長序列的泛化能力。同時，通過後期預訓練來擴展LLMs中的上下文窗口是非常耗資源的。為了應對這一問題，我們引入了**LongRecipe**，這是一種有效的訓練策略，用於擴展LLMs的上下文窗口，包括有影響力的標記分析、位置索引轉換和訓練優化策略。它模擬長序列輸入，同時保持訓練效率，顯著提高模型對長距離依賴的理解。對三種類型的LLMs進行的實驗表明，LongRecipe能夠利用長序列，同時只需目標上下文窗口大小的30％，並且與完整序列訓練相比，減少了超過85％的計算訓練資源。此外，LongRecipe還保留了原始LLMs在一般任務中的能力。最終，*我們可以將開源LLMs的有效上下文窗口從8k擴展到128k，僅使用一個具有80G內存的單個GPU進行一天的專用訓練，即可實現接近GPT-4的性能。*我們的代碼已發布在[鏈接](https://github.com/zhiyuanhubj/LongRecipe)。

English

Large language models (LLMs) face significant challenges in handling long-context tasks because of their limited effective context window size during pretraining, which restricts their ability to generalize over extended sequences. Meanwhile, extending the context window in LLMs through post-pretraining is highly resource-intensive. To address this, we introduce **LongRecipe**, an efficient training strategy for extending the context window of LLMs, including impactful token analysis, position index transformation, and training optimization strategies. It simulates long-sequence inputs while maintaining training efficiency and significantly improves the model's understanding of long-range dependencies. Experiments on three types of LLMs show that LongRecipe can utilize long sequences while requiring only 30% of the target context window size, and reduces computational training resource over 85% compared to full sequence training. Furthermore, LongRecipe also preserves the original LLM's capabilities in general tasks. Ultimately, *we can extend the effective context window of open-source LLMs from 8k to 128k, achieving performance close to GPT-4 with just one day of dedicated training using a single GPU with 80G memory.* Our code is released at the [link](https://github.com/zhiyuanhubj/LongRecipe).

長食譜：在大型語言模型中實現高效長文本泛化的食譜

LongRecipe: Recipe for Efficient Long Context Generalization in Large Languge Models

摘要

Support