Griffin:将门控线性循环与局部注意力相结合,实现高效的语言模型
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
February 29, 2024
作者: Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, Arnaud Doucet, David Budden, Yee Whye Teh, Razvan Pascanu, Nando De Freitas, Caglar Gulcehre
cs.AI
摘要
循环神经网络(RNN)具有快速推断和对长序列的高效扩展能力,但训练困难且难以扩展。我们提出了Hawk,一种带有门控线性循环的RNN,以及Griffin,一种混合模型,将门控线性循环与局部注意力相结合。Hawk在下游任务中超过了Mamba的表现,而Griffin在训练时使用的标记数量仅为Llama-2的六分之一,但表现相当。我们还展示了Griffin能够在训练中未见过的显著更长序列上进行外推。我们的模型在训练时与Transformer的硬件效率相匹配,在推断过程中具有更低的延迟和显著更高的吞吐量。我们将Griffin扩展至140亿参数,并解释了如何对我们的模型进行有效的分布式训练分片。
English
Recurrent neural networks (RNNs) have fast inference and scale efficiently on
long sequences, but they are difficult to train and hard to scale. We propose
Hawk, an RNN with gated linear recurrences, and Griffin, a hybrid model that
mixes gated linear recurrences with local attention. Hawk exceeds the reported
performance of Mamba on downstream tasks, while Griffin matches the performance
of Llama-2 despite being trained on over 6 times fewer tokens. We also show
that Griffin can extrapolate on sequences significantly longer than those seen
during training. Our models match the hardware efficiency of Transformers
during training, and during inference they have lower latency and significantly
higher throughput. We scale Griffin up to 14B parameters, and explain how to
shard our models for efficient distributed training.