LongNet：将Transformer扩展到10亿标记

摘要

在大语言模型时代，扩展序列长度已经成为一个关键需求。然而，现有方法在计算复杂度或模型表达能力方面存在困难，导致最大序列长度受限。在这项工作中，我们介绍了LongNet，一种Transformer变体，可以将序列长度扩展到超过10亿个标记，而不会牺牲对较短序列的性能。具体而言，我们提出了扩张注意力，随着距离增加，它会呈指数级地扩展注意力范围。LongNet具有显著优势：1）它具有线性计算复杂度和标记之间的对数依赖关系；2）它可以作为极长序列的分布式训练器；3）其扩张注意力可以直接替换标准注意力，并可以与现有基于Transformer的优化方案无缝集成。实验结果表明，LongNet在长序列建模和通用语言任务上表现出色。我们的工作为对建模非常长序列的新可能性打开了大门，例如将整个语料库甚至整个互联网视为一个序列。

English

Scaling sequence length has become a critical demand in the era of large language models. However, existing methods struggle with either computational complexity or model expressivity, rendering the maximum sequence length restricted. In this work, we introduce LongNet, a Transformer variant that can scale sequence length to more than 1 billion tokens, without sacrificing the performance on shorter sequences. Specifically, we propose dilated attention, which expands the attentive field exponentially as the distance grows. LongNet has significant advantages: 1) it has a linear computation complexity and a logarithm dependency between tokens; 2) it can be served as a distributed trainer for extremely long sequences; 3) its dilated attention is a drop-in replacement for standard attention, which can be seamlessly integrated with the existing Transformer-based optimization. Experiments results demonstrate that LongNet yields strong performance on both long-sequence modeling and general language tasks. Our work opens up new possibilities for modeling very long sequences, e.g., treating a whole corpus or even the entire Internet as a sequence.

LongNet：将Transformer扩展到10亿标记

LongNet: Scaling Transformers to 1,000,000,000 Tokens

摘要

Support