缓存变压器：利用可微存储器改进变压器缓存

摘要

本文介绍了一种名为缓存Transformer的新型Transformer模型，该模型使用门控循环缓存（GRC）注意力来扩展自注意力机制，具有可微分的记忆令牌缓存。GRC注意力使得模型能够同时关注过去和当前的令牌，增加了注意力的感受野，并允许探索长距离依赖关系。通过利用循环门控单元不断更新缓存，我们的模型在包括语言建模、机器翻译、ListOPs、图像分类、目标检测和实例分割在内的六项语言和视觉任务中取得了显著进展。此外，我们的方法在诸如语言建模等任务中超越了先前基于记忆的技术，并展现了适用于更广泛情境的能力。

English

This work introduces a new Transformer model called Cached Transformer, which uses Gated Recurrent Cached (GRC) attention to extend the self-attention mechanism with a differentiable memory cache of tokens. GRC attention enables attending to both past and current tokens, increasing the receptive field of attention and allowing for exploring long-range dependencies. By utilizing a recurrent gating unit to continuously update the cache, our model achieves significant advancements in six language and vision tasks, including language modeling, machine translation, ListOPs, image classification, object detection, and instance segmentation. Furthermore, our approach surpasses previous memory-based techniques in tasks such as language modeling and displays the ability to be applied to a broader range of situations.

缓存变压器：利用可微存储器改进变压器缓存

Cached Transformers: Improving Transformers with Differentiable Memory Cache

摘要

Support