Griffin：将门控线性循环与局部注意力相结合，实现高效的语言模型

摘要

循环神经网络（RNN）具有快速推断和对长序列的高效扩展能力，但训练困难且难以扩展。我们提出了Hawk，一种带有门控线性循环的RNN，以及Griffin，一种混合模型，将门控线性循环与局部注意力相结合。Hawk在下游任务中超过了Mamba的表现，而Griffin在训练时使用的标记数量仅为Llama-2的六分之一，但表现相当。我们还展示了Griffin能够在训练中未见过的显著更长序列上进行外推。我们的模型在训练时与Transformer的硬件效率相匹配，在推断过程中具有更低的延迟和显著更高的吞吐量。我们将Griffin扩展至140亿参数，并解释了如何对我们的模型进行有效的分布式训练分片。

English

Recurrent neural networks (RNNs) have fast inference and scale efficiently on long sequences, but they are difficult to train and hard to scale. We propose Hawk, an RNN with gated linear recurrences, and Griffin, a hybrid model that mixes gated linear recurrences with local attention. Hawk exceeds the reported performance of Mamba on downstream tasks, while Griffin matches the performance of Llama-2 despite being trained on over 6 times fewer tokens. We also show that Griffin can extrapolate on sequences significantly longer than those seen during training. Our models match the hardware efficiency of Transformers during training, and during inference they have lower latency and significantly higher throughput. We scale Griffin up to 14B parameters, and explain how to shard our models for efficient distributed training.

Griffin：将门控线性循环与局部注意力相结合，实现高效的语言模型

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

摘要

Support