Transformer 模型是多状态循环神经网络。

摘要

变压器与上一代最先进的自然语言处理模型——循环神经网络（RNN）在概念上被认为是不同的。在这项工作中，我们展示了仅包含解码器的变压器实际上可以被概念化为无限多状态的RNN——一种具有无限隐藏状态大小的RNN变体。我们进一步展示，预训练的变压器可以通过固定其隐藏状态的大小转换为有限多状态的RNN。我们观察到一些现有的变压器缓存压缩技术可以被构建为这种转换策略，并引入了一种新颖的策略TOVA，与这些策略相比更简单。我们在几个远程任务上的实验表明，TOVA优于所有其他基准策略，同时几乎与完整（无限）模型持平，并且在某些情况下仅使用原始缓存大小的1/8。我们的结果表明，变压器解码器LLM在实践中通常表现为RNN。它们还提出了缓解它们最痛苦的计算瓶颈之一——缓存内存大小的选择。我们在https://github.com/schwartz-lab-NLP/TOVA 上公开发布了我们的代码。

English

Transformers are considered conceptually different compared to the previous generation of state-of-the-art NLP models - recurrent neural networks (RNNs). In this work, we demonstrate that decoder-only transformers can in fact be conceptualized as infinite multi-state RNNs - an RNN variant with unlimited hidden state size. We further show that pretrained transformers can be converted into finite multi-state RNNs by fixing the size of their hidden state. We observe that several existing transformers cache compression techniques can be framed as such conversion policies, and introduce a novel policy, TOVA, which is simpler compared to these policies. Our experiments with several long range tasks indicate that TOVA outperforms all other baseline policies, while being nearly on par with the full (infinite) model, and using in some cases only 1{8} of the original cache size. Our results indicate that transformer decoder LLMs often behave in practice as RNNs. They also lay out the option of mitigating one of their most painful computational bottlenecks - the size of their cache memory. We publicly release our code at https://github.com/schwartz-lab-NLP/TOVA.

Transformer 模型是多状态循环神经网络。

Transformers are Multi-State RNNs

摘要

Support