리텐티브 네트워크: 대규모 언어 모델을 위한 트랜스포머의 후속 모델

초록

본 연구에서는 대규모 언어 모델을 위한 기반 아키텍처로 Retentive Network(RetNet)를 제안하며, 이를 통해 훈련 병렬화, 저비용 추론, 그리고 우수한 성능을 동시에 달성하고자 합니다. 우리는 이론적으로 재귀(recurrence)와 어텐션(attention) 간의 연결 관계를 도출하였습니다. 이후 시퀀스 모델링을 위한 리텐션(retention) 메커니즘을 제안하는데, 이는 병렬(parallel), 재귀(recurrent), 그리고 청크 단위 재귀(chunkwise recurrent)라는 세 가지 계산 패러다임을 지원합니다. 구체적으로, 병렬 표현은 훈련 병렬화를 가능하게 합니다. 재귀 표현은 성능 저하 없이 디코딩 처리량, 지연 시간, 그리고 GPU 메모리를 개선하는 O(1) 복잡도의 저비용 추론을 가능하게 합니다. 청크 단위 재귀 표현은 선형 복잡도로 효율적인 장거리 시퀀스 모델링을 용이하게 하는데, 각 청크는 병렬로 인코딩되면서 재귀적으로 청크들을 요약합니다. 언어 모델링 실험 결과, RetNet은 유리한 스케일링 결과, 병렬 훈련, 저비용 배포, 그리고 효율적인 추론을 달성함을 보여줍니다. 이러한 흥미로운 특성들은 RetNet을 대규모 언어 모델을 위한 Transformer의 강력한 후속 모델로 자리매김하게 합니다. 코드는 https://aka.ms/retnet에서 제공될 예정입니다.

English

In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance. We theoretically derive the connection between recurrence and attention. Then we propose the retention mechanism for sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent. Specifically, the parallel representation allows for training parallelism. The recurrent representation enables low-cost O(1) inference, which improves decoding throughput, latency, and GPU memory without sacrificing performance. The chunkwise recurrent representation facilitates efficient long-sequence modeling with linear complexity, where each chunk is encoded parallelly while recurrently summarizing the chunks. Experimental results on language modeling show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference. The intriguing properties make RetNet a strong successor to Transformer for large language models. Code will be available at https://aka.ms/retnet.

리텐티브 네트워크: 대규모 언어 모델을 위한 트랜스포머의 후속 모델

Retentive Network: A Successor to Transformer for Large Language Models

초록

Support