RWKV：為Transformer時代重新設計RNN

摘要

Transformer 已經在幾乎所有自然語言處理（NLP）任務中引起了革命，但面臨著隨著序列長度呈二次方增長的記憶和計算複雜度。相比之下，循環神經網絡（RNNs）展現出記憶和計算需求的線性增長，但由於在並行化和可擴展性方面存在限制，無法與 Transformer 达到相同的性能。我們提出了一種新穎的模型架構，稱為 Receptance Weighted Key Value（RWKV），結合了 Transformer 的高效並行訓練和 RNN 的高效推理。我們的方法利用線性注意機制，使我們能夠將模型制定為 Transformer 或 RNN，從而實現在訓練期間並行計算並在推理期間保持恆定的計算和記憶複雜度，使其成為第一個可擴展到數百億參數的非 Transformer 架構。我們的實驗顯示 RWKV 的性能與同等大小的 Transformer 相當，表明未來的工作可以利用這種架構創建更高效的模型。這項工作在協調序列處理任務中的計算效率和模型性能之間的權衡方面邁出了重要一步。

English

Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transformers due to limitations in parallelization and scalability. We propose a novel model architecture, Receptance Weighted Key Value (RWKV), that combines the efficient parallelizable training of Transformers with the efficient inference of RNNs. Our approach leverages a linear attention mechanism and allows us to formulate the model as either a Transformer or an RNN, which parallelizes computations during training and maintains constant computational and memory complexity during inference, leading to the first non-transformer architecture to be scaled to tens of billions of parameters. Our experiments reveal that RWKV performs on par with similarly sized Transformers, suggesting that future work can leverage this architecture to create more efficient models. This work presents a significant step towards reconciling the trade-offs between computational efficiency and model performance in sequence processing tasks.

RWKV：為Transformer時代重新設計RNN

RWKV: Reinventing RNNs for the Transformer Era

摘要

Support