緩存Transformer：利用可微記憶改進Transformer

摘要

本研究介紹了一種名為快取Transformer的新型Transformer模型，該模型使用閘控循環快取（GRC）注意力來擴展自注意機制，具備可微分的記憶快取功能。GRC注意力使模型能夠關注過去和當前的標記，擴大了注意力的感知範圍，並允許探索長距離依賴性。通過利用循環閘控單元持續更新快取，我們的模型在六項語言和視覺任務中取得了顯著進展，包括語言建模、機器翻譯、ListOPs、圖像分類、物體檢測和實例分割。此外，我們的方法在語言建模等任務中超越了先前基於記憶的技術，並展現了應用於更廣泛情境的能力。

English

This work introduces a new Transformer model called Cached Transformer, which uses Gated Recurrent Cached (GRC) attention to extend the self-attention mechanism with a differentiable memory cache of tokens. GRC attention enables attending to both past and current tokens, increasing the receptive field of attention and allowing for exploring long-range dependencies. By utilizing a recurrent gating unit to continuously update the cache, our model achieves significant advancements in six language and vision tasks, including language modeling, machine translation, ListOPs, image classification, object detection, and instance segmentation. Furthermore, our approach surpasses previous memory-based techniques in tasks such as language modeling and displays the ability to be applied to a broader range of situations.

緩存Transformer：利用可微記憶改進Transformer

Cached Transformers: Improving Transformers with Differentiable Memory Cache

摘要

Support