Transformer是多狀態循環神經網絡。
Transformers are Multi-State RNNs
January 11, 2024
作者: Matanel Oren, Michael Hassid, Yossi Adi, Roy Schwartz
cs.AI
摘要
Transformers 在概念上與先前一代的最先進自然語言處理(NLP)模型 - 循環神經網絡(RNN)有所不同。在這項工作中,我們展示了僅解碼器的 Transformer 實際上可以被概念化為無限多狀態 RNN - 一種具有無限隱藏狀態大小的 RNN 變體。我們進一步展示,預訓練的 Transformer 可以通過固定其隱藏狀態的大小轉換為有限多狀態 RNN。我們觀察到,幾種現有的 Transformer 緩存壓縮技術可以被視為這種轉換策略,並引入了一種新的策略 TOVA,與這些策略相比更為簡單。我們在幾個長距離任務上的實驗表明,TOVA 優於所有其他基準策略,同時幾乎與完整(無限)模型持平,有時僅使用原始緩存大小的 1/8。我們的結果表明,Transformer 解碼器 LLM 在實踐中通常表現為 RNN。它們還提供了緩解它們最痛苦的計算瓶頸之一 - 緩存內存大小的選項。我們在 https://github.com/schwartz-lab-NLP/TOVA 公開發布我們的代碼。
English
Transformers are considered conceptually different compared to the previous
generation of state-of-the-art NLP models - recurrent neural networks (RNNs).
In this work, we demonstrate that decoder-only transformers can in fact be
conceptualized as infinite multi-state RNNs - an RNN variant with unlimited
hidden state size. We further show that pretrained transformers can be
converted into finite multi-state RNNs by fixing the size of their
hidden state. We observe that several existing transformers cache compression
techniques can be framed as such conversion policies, and introduce a novel
policy, TOVA, which is simpler compared to these policies. Our experiments with
several long range tasks indicate that TOVA outperforms all other baseline
policies, while being nearly on par with the full (infinite) model, and using
in some cases only 1{8} of the original cache size. Our results
indicate that transformer decoder LLMs often behave in practice as RNNs. They
also lay out the option of mitigating one of their most painful computational
bottlenecks - the size of their cache memory. We publicly release our code at
https://github.com/schwartz-lab-NLP/TOVA.