リテンションネットワーク：大規模言語モデルのためのTransformerの後継者

要旨

本研究では、大規模言語モデルの基盤アーキテクチャとしてRetentive Network（RetNet）を提案し、訓練の並列性、低コストの推論、そして良好な性能を同時に実現します。理論的には、再帰性と注意機構の関連性を導出します。次に、シーケンスモデリングのための保持機構を提案し、並列、再帰、チャンク単位の再帰という3つの計算パラダイムをサポートします。具体的には、並列表現により訓練の並列性が可能となります。再帰表現は低コストのO(1)推論を実現し、性能を犠牲にすることなくデコードのスループット、レイテンシ、GPUメモリを改善します。チャンク単位の再帰表現は、線形計算量で効率的な長シーケンスモデリングを可能にし、各チャンクは並列にエンコードされながら、チャンクを再帰的に要約します。言語モデリングの実験結果は、RetNetが良好なスケーリング結果、並列訓練、低コストのデプロイメント、効率的な推論を達成することを示しています。これらの興味深い特性により、RetNetは大規模言語モデルにおけるTransformerの強力な後継者となります。コードはhttps://aka.ms/retnetで公開予定です。

English

In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance. We theoretically derive the connection between recurrence and attention. Then we propose the retention mechanism for sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent. Specifically, the parallel representation allows for training parallelism. The recurrent representation enables low-cost O(1) inference, which improves decoding throughput, latency, and GPU memory without sacrificing performance. The chunkwise recurrent representation facilitates efficient long-sequence modeling with linear complexity, where each chunk is encoded parallelly while recurrently summarizing the chunks. Experimental results on language modeling show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference. The intriguing properties make RetNet a strong successor to Transformer for large language models. Code will be available at https://aka.ms/retnet.

リテンションネットワーク：大規模言語モデルのためのTransformerの後継者

Retentive Network: A Successor to Transformer for Large Language Models

要旨

Support