RWKV：トランスフォーマー時代におけるRNNの再発明

要旨

Transformerは、ほぼすべての自然言語処理（NLP）タスクに革命をもたらしましたが、シーケンス長に対して二次的にスケールするメモリと計算の複雑さに悩まされています。一方、リカレントニューラルネットワーク（RNN）は、メモリと計算要件が線形にスケールしますが、並列化とスケーラビリティの制限により、Transformerと同じ性能を達成するのに苦労しています。本論文では、Transformerの効率的な並列化トレーニングとRNNの効率的な推論を組み合わせた新しいモデルアーキテクチャ、Receptance Weighted Key Value（RWKV）を提案します。我々のアプローチは、線形アテンションメカニズムを活用し、モデルをTransformerまたはRNNとして定式化することを可能にします。これにより、トレーニング中に計算を並列化し、推論中に計算とメモリの複雑さを一定に保ち、数百億のパラメータにスケールする初の非Transformerアーキテクチャを実現しました。実験結果から、RWKVは同規模のTransformerと同等の性能を発揮することが明らかになり、将来の研究においてこのアーキテクチャを活用してより効率的なモデルを作成できる可能性が示唆されています。本研究は、シーケンス処理タスクにおける計算効率とモデル性能のトレードオフを調和させるための重要な一歩を提示します。

English

Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transformers due to limitations in parallelization and scalability. We propose a novel model architecture, Receptance Weighted Key Value (RWKV), that combines the efficient parallelizable training of Transformers with the efficient inference of RNNs. Our approach leverages a linear attention mechanism and allows us to formulate the model as either a Transformer or an RNN, which parallelizes computations during training and maintains constant computational and memory complexity during inference, leading to the first non-transformer architecture to be scaled to tens of billions of parameters. Our experiments reveal that RWKV performs on par with similarly sized Transformers, suggesting that future work can leverage this architecture to create more efficient models. This work presents a significant step towards reconciling the trade-offs between computational efficiency and model performance in sequence processing tasks.

RWKV：トランスフォーマー時代におけるRNNの再発明

RWKV: Reinventing RNNs for the Transformer Era

要旨

Support