LM-Infinite：大型语言模型的简单即时长度泛化

摘要

近年来，基于Transformer的大型语言模型（LLM）在各个领域的性能取得了显著进展。随着这些LLM被部署到越来越复杂的任务中，它们经常需要进行更长的推理过程或理解更大的上下文。在这些情况下，LLM在长序列上的长度泛化失败变得更加突出。大多数预训练方案会将训练序列截断为固定长度（例如LLaMa的2048）。即使使用了相对位置编码来应对这个问题，LLM在更长的上下文之后往往难以生成流畅的文本，更别提完成下游任务了。常见的解决方案，比如在更长的语料库上微调，往往涉及艰巨的硬件和时间成本，并需要谨慎设计训练过程。为了更有效地利用现有LLM的生成能力，我们在理论和实证上研究了导致这一问题的主要分布外因素。受到这一诊断的启发，我们提出了一个简单而有效的解决方案，即LM-Infinite，它仅涉及一个Lambda形状的注意力掩码和一个距离限制，无需参数更新或学习。我们发现它适用于使用相对位置编码方法的各种LLM。LM-Infinite在计算上高效，时间和空间复杂度为O(n)，在ArXiv和OpenWebText2数据集上展现出一致的流畅性和生成质量，最长可达32k个标记，并且解码速度提升了2.72倍。在下游任务（如密码检索）中，它可以继续处理比训练长度长得多的输入，而传统模型会立即失败。

English

In recent years, there have been remarkable advancements in the performance of Transformer-based Large Language Models (LLMs) across various domains. As these LLMs are deployed for increasingly complex tasks, they often face the needs to conduct longer reasoning processes or understanding larger contexts. In these situations, the length generalization failure of LLMs on long sequences become more prominent. Most pre-training schemes truncate training sequences to a fixed length (such as 2048 for LLaMa). LLMs often struggle to generate fluent texts, let alone carry out downstream tasks, after longer contexts, even with relative positional encoding which is designed to cope with this problem. Common solutions such as finetuning on longer corpora often involves daunting hardware and time costs and requires careful training process design. To more efficiently leverage the generation capacity of existing LLMs, we theoretically and empirically investigate the main out-of-distribution (OOD) factors contributing to this problem. Inspired by this diagnosis, we propose a simple yet effective solution for on-the-fly length generalization, LM-Infinite, which involves only a Lambda-shaped attention mask and a distance limit while requiring no parameter updates or learning. We find it applicable to a variety of LLMs using relative-position encoding methods. LM-Infinite is computational efficient with O(n) time and space, and demonstrates consistent fluency and generation quality to as long as 32k tokens on ArXiv and OpenWebText2 datasets, with 2.72x decoding speedup. On downstream task such as passkey retrieval, it continues to work on inputs much longer than training lengths where vanilla models fail immediately.

LM-Infinite：大型语言模型的简单即时长度泛化

LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models

摘要

Support