緩存Transformer:利用可微記憶改進Transformer
Cached Transformers: Improving Transformers with Differentiable Memory Cache
December 20, 2023
作者: Zhaoyang Zhang, Wenqi Shao, Yixiao Ge, Xiaogang Wang, Jinwei Gu, Ping Luo
cs.AI
摘要
本研究介紹了一種名為快取Transformer的新型Transformer模型,該模型使用閘控循環快取(GRC)注意力來擴展自注意機制,具備可微分的記憶快取功能。GRC注意力使模型能夠關注過去和當前的標記,擴大了注意力的感知範圍,並允許探索長距離依賴性。通過利用循環閘控單元持續更新快取,我們的模型在六項語言和視覺任務中取得了顯著進展,包括語言建模、機器翻譯、ListOPs、圖像分類、物體檢測和實例分割。此外,我們的方法在語言建模等任務中超越了先前基於記憶的技術,並展現了應用於更廣泛情境的能力。
English
This work introduces a new Transformer model called Cached Transformer, which
uses Gated Recurrent Cached (GRC) attention to extend the self-attention
mechanism with a differentiable memory cache of tokens. GRC attention enables
attending to both past and current tokens, increasing the receptive field of
attention and allowing for exploring long-range dependencies. By utilizing a
recurrent gating unit to continuously update the cache, our model achieves
significant advancements in six language and vision tasks, including
language modeling, machine translation, ListOPs, image classification, object
detection, and instance segmentation. Furthermore, our approach surpasses
previous memory-based techniques in tasks such as language modeling and
displays the ability to be applied to a broader range of situations.