Transformer是多狀態循環神經網絡。

摘要

Transformers 在概念上與先前一代的最先進自然語言處理（NLP）模型 - 循環神經網絡（RNN）有所不同。在這項工作中，我們展示了僅解碼器的 Transformer 實際上可以被概念化為無限多狀態 RNN - 一種具有無限隱藏狀態大小的 RNN 變體。我們進一步展示，預訓練的 Transformer 可以通過固定其隱藏狀態的大小轉換為有限多狀態 RNN。我們觀察到，幾種現有的 Transformer 緩存壓縮技術可以被視為這種轉換策略，並引入了一種新的策略 TOVA，與這些策略相比更為簡單。我們在幾個長距離任務上的實驗表明，TOVA 優於所有其他基準策略，同時幾乎與完整（無限）模型持平，有時僅使用原始緩存大小的 1/8。我們的結果表明，Transformer 解碼器 LLM 在實踐中通常表現為 RNN。它們還提供了緩解它們最痛苦的計算瓶頸之一 - 緩存內存大小的選項。我們在 https://github.com/schwartz-lab-NLP/TOVA 公開發布我們的代碼。

English

Transformers are considered conceptually different compared to the previous generation of state-of-the-art NLP models - recurrent neural networks (RNNs). In this work, we demonstrate that decoder-only transformers can in fact be conceptualized as infinite multi-state RNNs - an RNN variant with unlimited hidden state size. We further show that pretrained transformers can be converted into finite multi-state RNNs by fixing the size of their hidden state. We observe that several existing transformers cache compression techniques can be framed as such conversion policies, and introduce a novel policy, TOVA, which is simpler compared to these policies. Our experiments with several long range tasks indicate that TOVA outperforms all other baseline policies, while being nearly on par with the full (infinite) model, and using in some cases only 1{8} of the original cache size. Our results indicate that transformer decoder LLMs often behave in practice as RNNs. They also lay out the option of mitigating one of their most painful computational bottlenecks - the size of their cache memory. We publicly release our code at https://github.com/schwartz-lab-NLP/TOVA.

Transformer是多狀態循環神經網絡。

Transformers are Multi-State RNNs

摘要

Support