LongNet: 트랜스포머를 1,000,000,000 토큰으로 확장하기

초록

대규모 언어 모델 시대에서 시퀀스 길이 확장은 중요한 요구사항으로 부상했습니다. 그러나 기존 방법들은 계산 복잡성이나 모델 표현력에 어려움을 겪으며, 최대 시퀀스 길이가 제한되는 문제를 안고 있습니다. 본 연구에서는 더 짧은 시퀀스에서의 성능을 희생하지 않으면서도 시퀀스 길이를 10억 토큰 이상으로 확장할 수 있는 Transformer 변형인 LongNet을 소개합니다. 구체적으로, 거리가 증가함에 따라 주의 영역을 기하급수적으로 확장하는 확장 주의(dilated attention)를 제안합니다. LongNet은 다음과 같은 중요한 장점을 가지고 있습니다: 1) 선형 계산 복잡성과 토큰 간 로그 의존성을 가집니다; 2) 극도로 긴 시퀀스를 위한 분산 학습기로 사용될 수 있습니다; 3) 확장 주의는 표준 주의를 대체할 수 있는 드롭인(drop-in) 방식으로, 기존 Transformer 기반 최적화와 원활하게 통합될 수 있습니다. 실험 결과는 LongNet이 긴 시퀀스 모델링과 일반 언어 작업 모두에서 강력한 성능을 보여줌을 입증합니다. 본 연구는 전체 코퍼스나 심지어 인터넷 전체를 하나의 시퀀스로 다루는 등 매우 긴 시퀀스를 모델링하는 새로운 가능성을 열어줍니다.

English

Scaling sequence length has become a critical demand in the era of large language models. However, existing methods struggle with either computational complexity or model expressivity, rendering the maximum sequence length restricted. In this work, we introduce LongNet, a Transformer variant that can scale sequence length to more than 1 billion tokens, without sacrificing the performance on shorter sequences. Specifically, we propose dilated attention, which expands the attentive field exponentially as the distance grows. LongNet has significant advantages: 1) it has a linear computation complexity and a logarithm dependency between tokens; 2) it can be served as a distributed trainer for extremely long sequences; 3) its dilated attention is a drop-in replacement for standard attention, which can be seamlessly integrated with the existing Transformer-based optimization. Experiments results demonstrate that LongNet yields strong performance on both long-sequence modeling and general language tasks. Our work opens up new possibilities for modeling very long sequences, e.g., treating a whole corpus or even the entire Internet as a sequence.

LongNet: 트랜스포머를 1,000,000,000 토큰으로 확장하기

LongNet: Scaling Transformers to 1,000,000,000 Tokens

초록

Support