保留网络：大型语言模型的变形金刚接班人

摘要

在这项工作中，我们提出了保留网络（RetNet）作为大型语言模型的基础架构，同时实现了训练并行性、低成本推断和良好性能。我们从理论上推导了循环和注意力之间的联系。然后，我们提出了用于序列建模的保留机制，支持三种计算范式，即并行、循环和分块循环。具体而言，并行表示允许进行训练并行化。循环表示实现了低成本的O(1)推断，提高了解码吞吐量、延迟和GPU内存，而不牺牲性能。分块循环表示有助于使用线性复杂度进行高效的长序列建模，其中每个分块都是并行编码的，同时循环地总结这些分块。在语言建模的实验结果中显示，RetNet取得了有利的扩展结果，实现了并行训练、低成本部署和高效推断。这些引人入胜的特性使RetNet成为大型语言模型中Transformer的强有力继任者。代码将在https://aka.ms/retnet 上提供。

English

In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance. We theoretically derive the connection between recurrence and attention. Then we propose the retention mechanism for sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent. Specifically, the parallel representation allows for training parallelism. The recurrent representation enables low-cost O(1) inference, which improves decoding throughput, latency, and GPU memory without sacrificing performance. The chunkwise recurrent representation facilitates efficient long-sequence modeling with linear complexity, where each chunk is encoded parallelly while recurrently summarizing the chunks. Experimental results on language modeling show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference. The intriguing properties make RetNet a strong successor to Transformer for large language models. Code will be available at https://aka.ms/retnet.

保留网络：大型语言模型的变形金刚接班人

Retentive Network: A Successor to Transformer for Large Language Models

摘要

Support