保留網絡：大型語言模型的 Transformer 後繼者

摘要

在這份工作中，我們提出了保留網絡（RetNet）作為大型語言模型的基礎架構，同時實現了訓練的並行性、低成本推斷和良好的性能。我們從理論上推導了循環和注意力之間的聯繫。然後，我們提出了用於序列建模的保留機制，支持三種計算範式，即並行、循環和分塊循環。具體而言，並行表示允許進行訓練的並行性。循環表示實現了低成本的O(1)推斷，從而提高了解碼吞吐量、延遲和GPU內存，而不會影響性能。分塊循環表示促進了具有線性複雜度的高效長序列建模，其中每個分塊在並行編碼的同時進行循環總結。語言建模的實驗結果顯示，RetNet實現了良好的擴展結果、並行訓練、低成本部署和高效推斷。這些引人入勝的特性使RetNet成為大型語言模型的強大後繼者。代碼將在https://aka.ms/retnet 上提供。

English

In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance. We theoretically derive the connection between recurrence and attention. Then we propose the retention mechanism for sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent. Specifically, the parallel representation allows for training parallelism. The recurrent representation enables low-cost O(1) inference, which improves decoding throughput, latency, and GPU memory without sacrificing performance. The chunkwise recurrent representation facilitates efficient long-sequence modeling with linear complexity, where each chunk is encoded parallelly while recurrently summarizing the chunks. Experimental results on language modeling show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference. The intriguing properties make RetNet a strong successor to Transformer for large language models. Code will be available at https://aka.ms/retnet.

保留網絡：大型語言模型的 Transformer 後繼者

Retentive Network: A Successor to Transformer for Large Language Models

摘要

Support