Griffin: 효율적인 언어 모델을 위한 게이트 선형 순환과 지역 어텐션의 결합

초록

순환 신경망(RNN)은 긴 시퀀스에서 빠른 추론과 효율적인 확장성을 보이지만, 학습이 어렵고 확장하기도 힘든 단점이 있습니다. 우리는 게이트 선형 순환을 사용한 RNN인 Hawk와, 게이트 선형 순환과 지역적 어텐션을 혼합한 하이브리드 모델인 Griffin을 제안합니다. Hawk는 다운스트림 작업에서 Mamba의 보고된 성능을 능가하며, Griffin은 Llama-2의 성능을 유지하면서도 학습 토큰 수를 6배 이상 줄였습니다. 또한 Griffin은 학습 중에 본 시퀀스보다 훨씬 더 긴 시퀀스에서도 외삽(extrapolate)할 수 있음을 보여줍니다. 우리의 모델은 학습 중 트랜스포머와 동등한 하드웨어 효율성을 유지하며, 추론 시에는 더 낮은 지연 시간과 훨씬 높은 처리량을 제공합니다. Griffin을 140억 파라미터 규모로 확장하고, 효율적인 분산 학습을 위한 모델 샤딩 방법도 설명합니다.

English

Recurrent neural networks (RNNs) have fast inference and scale efficiently on long sequences, but they are difficult to train and hard to scale. We propose Hawk, an RNN with gated linear recurrences, and Griffin, a hybrid model that mixes gated linear recurrences with local attention. Hawk exceeds the reported performance of Mamba on downstream tasks, while Griffin matches the performance of Llama-2 despite being trained on over 6 times fewer tokens. We also show that Griffin can extrapolate on sequences significantly longer than those seen during training. Our models match the hardware efficiency of Transformers during training, and during inference they have lower latency and significantly higher throughput. We scale Griffin up to 14B parameters, and explain how to shard our models for efficient distributed training.

Griffin: 효율적인 언어 모델을 위한 게이트 선형 순환과 지역 어텐션의 결합

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

초록

Support