LM-Infinite：大型語言模型的即時長度泛化簡單方法

摘要

近年來，基於Transformer的大型語言模型（LLMs）在各個領域的表現有顯著進步。隨著這些LLMs被應用於越來越複雜的任務，它們通常需要進行更長的推理過程或理解更大的上下文。在這些情況下，LLMs對於長序列的長度泛化失敗變得更加突出。大多數預訓練方案將訓練序列截斷到固定長度（例如LLaMa的2048）。即使使用了相對位置編碼來應對這個問題，LLMs在更長的上下文之後往往難以生成流暢的文本，更不用說執行下游任務了。常見的解決方案，例如在更長的語料庫上進行微調，通常涉及龐大的硬件和時間成本，並需要仔細設計訓練過程。為了更有效地利用現有LLMs的生成能力，我們在理論上和實證上調查了導致這個問題的主要超出分布（OOD）因素。受到這一診斷的啟發，我們提出了一個簡單而有效的解決方案，即即時長度泛化LM-Infinite，它僅涉及一個Lambda形狀的注意力遮罩和一個距離限制，而無需進行參數更新或學習。我們發現這適用於使用相對位置編碼方法的各種LLMs。LM-Infinite在計算上高效，具有O(n)的時間和空間複雜度，並在ArXiv和OpenWebText2數據集上展示了一致的流暢性和生成質量，最長可達32k標記，並實現了2.72倍的解碼加速。在下游任務（例如passkey檢索）上，它繼續在比訓練長度長得多的輸入上工作，而普通模型則會立即失敗。

English

In recent years, there have been remarkable advancements in the performance of Transformer-based Large Language Models (LLMs) across various domains. As these LLMs are deployed for increasingly complex tasks, they often face the needs to conduct longer reasoning processes or understanding larger contexts. In these situations, the length generalization failure of LLMs on long sequences become more prominent. Most pre-training schemes truncate training sequences to a fixed length (such as 2048 for LLaMa). LLMs often struggle to generate fluent texts, let alone carry out downstream tasks, after longer contexts, even with relative positional encoding which is designed to cope with this problem. Common solutions such as finetuning on longer corpora often involves daunting hardware and time costs and requires careful training process design. To more efficiently leverage the generation capacity of existing LLMs, we theoretically and empirically investigate the main out-of-distribution (OOD) factors contributing to this problem. Inspired by this diagnosis, we propose a simple yet effective solution for on-the-fly length generalization, LM-Infinite, which involves only a Lambda-shaped attention mask and a distance limit while requiring no parameter updates or learning. We find it applicable to a variety of LLMs using relative-position encoding methods. LM-Infinite is computational efficient with O(n) time and space, and demonstrates consistent fluency and generation quality to as long as 32k tokens on ArXiv and OpenWebText2 datasets, with 2.72x decoding speedup. On downstream task such as passkey retrieval, it continues to work on inputs much longer than training lengths where vanilla models fail immediately.

LM-Infinite：大型語言模型的即時長度泛化簡單方法

LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models

摘要

Support