RWKV:為Transformer時代重新設計RNN
RWKV: Reinventing RNNs for the Transformer Era
May 22, 2023
作者: Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Xiangru Tang, Bolun Wang, Johan S. Wind, Stansilaw Wozniak, Ruichong Zhang, Zhenyuan Zhang, Qihang Zhao, Peng Zhou, Jian Zhu, Rui-Jie Zhu
cs.AI
摘要
Transformer 已經在幾乎所有自然語言處理(NLP)任務中引起了革命,但面臨著隨著序列長度呈二次方增長的記憶和計算複雜度。相比之下,循環神經網絡(RNNs)展現出記憶和計算需求的線性增長,但由於在並行化和可擴展性方面存在限制,無法與 Transformer 达到相同的性能。我們提出了一種新穎的模型架構,稱為 Receptance Weighted Key Value(RWKV),結合了 Transformer 的高效並行訓練和 RNN 的高效推理。我們的方法利用線性注意機制,使我們能夠將模型制定為 Transformer 或 RNN,從而實現在訓練期間並行計算並在推理期間保持恆定的計算和記憶複雜度,使其成為第一個可擴展到數百億參數的非 Transformer 架構。我們的實驗顯示 RWKV 的性能與同等大小的 Transformer 相當,表明未來的工作可以利用這種架構創建更高效的模型。這項工作在協調序列處理任務中的計算效率和模型性能之間的權衡方面邁出了重要一步。
English
Transformers have revolutionized almost all natural language processing (NLP)
tasks but suffer from memory and computational complexity that scales
quadratically with sequence length. In contrast, recurrent neural networks
(RNNs) exhibit linear scaling in memory and computational requirements but
struggle to match the same performance as Transformers due to limitations in
parallelization and scalability. We propose a novel model architecture,
Receptance Weighted Key Value (RWKV), that combines the efficient
parallelizable training of Transformers with the efficient inference of RNNs.
Our approach leverages a linear attention mechanism and allows us to formulate
the model as either a Transformer or an RNN, which parallelizes computations
during training and maintains constant computational and memory complexity
during inference, leading to the first non-transformer architecture to be
scaled to tens of billions of parameters. Our experiments reveal that RWKV
performs on par with similarly sized Transformers, suggesting that future work
can leverage this architecture to create more efficient models. This work
presents a significant step towards reconciling the trade-offs between
computational efficiency and model performance in sequence processing tasks.