Griffin:將閘控線性循環與局部注意力相結合,以提高語言模型效率
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
February 29, 2024
作者: Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, Arnaud Doucet, David Budden, Yee Whye Teh, Razvan Pascanu, Nando De Freitas, Caglar Gulcehre
cs.AI
摘要
循環神經網絡(RNN)在長序列上具有快速推斷和高效擴展的特性,但訓練困難且難以擴展。我們提出了Hawk,一種具有閘控線性循環的RNN,以及Griffin,一種混合模型,將閘控線性循環與局部注意力相結合。Hawk在下游任務的表現超越了Mamba的報告,而Griffin儘管僅訓練了超過6倍少的標記,但與Llama-2的表現相匹配。我們還展示了Griffin能夠在訓練期間未見的遠比訓練序列長得多的序列上進行外推。我們的模型在訓練期間與Transformer的硬體效率相匹配,並且在推斷期間具有更低的延遲和顯著更高的吞吐量。我們將Griffin擴展至14B參數,並解釋了如何對我們的模型進行有效的分佈式訓練。
English
Recurrent neural networks (RNNs) have fast inference and scale efficiently on
long sequences, but they are difficult to train and hard to scale. We propose
Hawk, an RNN with gated linear recurrences, and Griffin, a hybrid model that
mixes gated linear recurrences with local attention. Hawk exceeds the reported
performance of Mamba on downstream tasks, while Griffin matches the performance
of Llama-2 despite being trained on over 6 times fewer tokens. We also show
that Griffin can extrapolate on sequences significantly longer than those seen
during training. Our models match the hardware efficiency of Transformers
during training, and during inference they have lower latency and significantly
higher throughput. We scale Griffin up to 14B parameters, and explain how to
shard our models for efficient distributed training.