Griffin：將閘控線性循環與局部注意力相結合，以提高語言模型效率

摘要

循環神經網絡（RNN）在長序列上具有快速推斷和高效擴展的特性，但訓練困難且難以擴展。我們提出了Hawk，一種具有閘控線性循環的RNN，以及Griffin，一種混合模型，將閘控線性循環與局部注意力相結合。Hawk在下游任務的表現超越了Mamba的報告，而Griffin儘管僅訓練了超過6倍少的標記，但與Llama-2的表現相匹配。我們還展示了Griffin能夠在訓練期間未見的遠比訓練序列長得多的序列上進行外推。我們的模型在訓練期間與Transformer的硬體效率相匹配，並且在推斷期間具有更低的延遲和顯著更高的吞吐量。我們將Griffin擴展至14B參數，並解釋了如何對我們的模型進行有效的分佈式訓練。

English

Recurrent neural networks (RNNs) have fast inference and scale efficiently on long sequences, but they are difficult to train and hard to scale. We propose Hawk, an RNN with gated linear recurrences, and Griffin, a hybrid model that mixes gated linear recurrences with local attention. Hawk exceeds the reported performance of Mamba on downstream tasks, while Griffin matches the performance of Llama-2 despite being trained on over 6 times fewer tokens. We also show that Griffin can extrapolate on sequences significantly longer than those seen during training. Our models match the hardware efficiency of Transformers during training, and during inference they have lower latency and significantly higher throughput. We scale Griffin up to 14B parameters, and explain how to shard our models for efficient distributed training.

Griffin：將閘控線性循環與局部注意力相結合，以提高語言模型效率

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

摘要

Support