用于高效线性时间序列建模的门控槽注意力
Gated Slot Attention for Efficient Linear-Time Sequence Modeling
September 11, 2024
作者: Yu Zhang, Songlin Yang, Ruijie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, Peng Zhou, Guohong Fu
cs.AI
摘要
线性注意力变换器及其门控变体因实现并行训练和高效的循环推断而备受赞誉,但在与传统变换器相比的召回密集型任务中仍然表现不佳,并且需要大量资源从头开始训练。本文介绍了门控槽注意力(GSA),通过结合受门控线性注意力(GLA)启发的门控机制,将注意力与有界记忆控制(ABC)相结合,从而增强了注意力。基本上,GSA由两层GLA组成,通过softmax连接,利用上下文感知记忆读取和自适应遗忘来提高记忆容量,同时保持紧凑的循环状态大小。这种设计通过GLA的硬件高效训练算法和减小状态大小极大地提升了训练和推断效率。此外,保留softmax操作在“微调预训练的变换器到循环神经网络”(T2R)设置中特别有益,减少了需要从头开始广泛训练的需求。大量实验证实了GSA在需要上下文召回和T2R设置中的卓越性能。
English
Linear attention Transformers and their gated variants, celebrated for
enabling parallel training and efficient recurrent inference, still fall short
in recall-intensive tasks compared to traditional Transformers and demand
significant resources for training from scratch. This paper introduces Gated
Slot Attention (GSA), which enhances Attention with Bounded-memory-Control
(ABC) by incorporating a gating mechanism inspired by Gated Linear Attention
(GLA). Essentially, GSA comprises a two-layer GLA linked via softmax, utilizing
context-aware memory reading and adaptive forgetting to improve memory capacity
while maintaining compact recurrent state size. This design greatly enhances
both training and inference efficiency through GLA's hardware-efficient
training algorithm and reduced state size. Additionally, retaining the softmax
operation is particularly beneficial in "finetuning pretrained Transformers to
RNNs" (T2R) settings, reducing the need for extensive training from scratch.
Extensive experiments confirm GSA's superior performance in scenarios requiring
in-context recall and in T2R settings.Summary
AI-Generated Summary