ChatPaper.aiChatPaper

閘門式槽注意力以實現高效的線性時間序列建模

Gated Slot Attention for Efficient Linear-Time Sequence Modeling

September 11, 2024
作者: Yu Zhang, Songlin Yang, Ruijie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, Peng Zhou, Guohong Fu
cs.AI

摘要

線性注意力Transformer及其閘控變體因實現並行訓練和高效遞歸推斷而受到讚譽,但在需要高召回率任務中仍不及傳統Transformer,並需要大量資源從頭開始訓練。本文介紹了閘控槽注意力(GSA),通過將受閘控線性注意力(GLA)啟發的閘控機制融入具有有界記憶控制(ABC)的注意力,以增強注意力。基本上,GSA包括通過softmax連接的兩層GLA,利用上下文感知記憶讀取和適應性遺忘來提高記憶容量,同時保持緊湊的遞歸狀態大小。這種設計通過GLA的硬體高效訓練算法和減少狀態大小,極大地提高了訓練和推斷效率。此外,保留softmax操作在“將預訓練Transformer微調為RNN”(T2R)設置中特別有益,減少了從頭開始進行廣泛訓練的需求。大量實驗證實了GSA在需要上下文召回和T2R設置中的優越性能。
English
Linear attention Transformers and their gated variants, celebrated for enabling parallel training and efficient recurrent inference, still fall short in recall-intensive tasks compared to traditional Transformers and demand significant resources for training from scratch. This paper introduces Gated Slot Attention (GSA), which enhances Attention with Bounded-memory-Control (ABC) by incorporating a gating mechanism inspired by Gated Linear Attention (GLA). Essentially, GSA comprises a two-layer GLA linked via softmax, utilizing context-aware memory reading and adaptive forgetting to improve memory capacity while maintaining compact recurrent state size. This design greatly enhances both training and inference efficiency through GLA's hardware-efficient training algorithm and reduced state size. Additionally, retaining the softmax operation is particularly beneficial in "finetuning pretrained Transformers to RNNs" (T2R) settings, reducing the need for extensive training from scratch. Extensive experiments confirm GSA's superior performance in scenarios requiring in-context recall and in T2R settings.

Summary

AI-Generated Summary

PDF212November 16, 2024