缓存变压器:利用可微存储器改进变压器 缓存
Cached Transformers: Improving Transformers with Differentiable Memory Cache
December 20, 2023
作者: Zhaoyang Zhang, Wenqi Shao, Yixiao Ge, Xiaogang Wang, Jinwei Gu, Ping Luo
cs.AI
摘要
本文介绍了一种名为缓存Transformer的新型Transformer模型,该模型使用门控循环缓存(GRC)注意力来扩展自注意力机制,具有可微分的记忆令牌缓存。GRC注意力使得模型能够同时关注过去和当前的令牌,增加了注意力的感受野,并允许探索长距离依赖关系。通过利用循环门控单元不断更新缓存,我们的模型在包括语言建模、机器翻译、ListOPs、图像分类、目标检测和实例分割在内的六项语言和视觉任务中取得了显著进展。此外,我们的方法在诸如语言建模等任务中超越了先前基于记忆的技术,并展现了适用于更广泛情境的能力。
English
This work introduces a new Transformer model called Cached Transformer, which
uses Gated Recurrent Cached (GRC) attention to extend the self-attention
mechanism with a differentiable memory cache of tokens. GRC attention enables
attending to both past and current tokens, increasing the receptive field of
attention and allowing for exploring long-range dependencies. By utilizing a
recurrent gating unit to continuously update the cache, our model achieves
significant advancements in six language and vision tasks, including
language modeling, machine translation, ListOPs, image classification, object
detection, and instance segmentation. Furthermore, our approach surpasses
previous memory-based techniques in tasks such as language modeling and
displays the ability to be applied to a broader range of situations.