LongNet:将Transformer扩展到10亿标记
LongNet: Scaling Transformers to 1,000,000,000 Tokens
July 5, 2023
作者: Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Shaohan Huang, Wenhui Wang, Furu Wei
cs.AI
摘要
在大语言模型时代,扩展序列长度已经成为一个关键需求。然而,现有方法在计算复杂度或模型表达能力方面存在困难,导致最大序列长度受限。在这项工作中,我们介绍了LongNet,一种Transformer变体,可以将序列长度扩展到超过10亿个标记,而不会牺牲对较短序列的性能。具体而言,我们提出了扩张注意力,随着距离增加,它会呈指数级地扩展注意力范围。LongNet具有显著优势:1)它具有线性计算复杂度和标记之间的对数依赖关系;2)它可以作为极长序列的分布式训练器;3)其扩张注意力可以直接替换标准注意力,并可以与现有基于Transformer的优化方案无缝集成。实验结果表明,LongNet在长序列建模和通用语言任务上表现出色。我们的工作为对建模非常长序列的新可能性打开了大门,例如将整个语料库甚至整个互联网视为一个序列。
English
Scaling sequence length has become a critical demand in the era of large
language models. However, existing methods struggle with either computational
complexity or model expressivity, rendering the maximum sequence length
restricted. In this work, we introduce LongNet, a Transformer variant that can
scale sequence length to more than 1 billion tokens, without sacrificing the
performance on shorter sequences. Specifically, we propose dilated attention,
which expands the attentive field exponentially as the distance grows. LongNet
has significant advantages: 1) it has a linear computation complexity and a
logarithm dependency between tokens; 2) it can be served as a distributed
trainer for extremely long sequences; 3) its dilated attention is a drop-in
replacement for standard attention, which can be seamlessly integrated with the
existing Transformer-based optimization. Experiments results demonstrate that
LongNet yields strong performance on both long-sequence modeling and general
language tasks. Our work opens up new possibilities for modeling very long
sequences, e.g., treating a whole corpus or even the entire Internet as a
sequence.