Megalodon: 무제한 컨텍스트 길이를 통한 효율적인 대형 언어 모델 사전 학습 및 추론

초록

트랜스포머의 이차 복잡도와 약한 길이 외삽 능력은 긴 시퀀스로 확장하는 데 한계를 보입니다. 선형 어텐션과 상태 공간 모델과 같은 이차 미만의 해결책이 존재하지만, 이들은 사전 학습 효율성과 다운스트림 작업 정확도에서 트랜스포머에 비해 경험적으로 뒤떨어지는 성능을 보입니다. 우리는 무제한 컨텍스트 길이를 위한 효율적인 시퀀스 모델링 신경망 아키텍처인 Megalodon을 소개합니다. Megalodon은 Mega(게이트 어텐션을 적용한 지수 이동 평균)의 아키텍처를 계승하며, 복소수 지수 이동 평균(CEMA), 시간 단계 정규화 계층, 정규화된 어텐션 메커니즘, 그리고 두 홉 잔차 구성을 적용한 사전 정규화와 같은 여러 기술적 구성 요소를 도입하여 성능과 안정성을 개선했습니다. Llama2와의 엄격한 헤드투헤드 비교에서, Megalodon은 70억 개의 파라미터와 2조 개의 학습 토큰 규모에서 트랜스포머보다 더 나은 효율성을 달성했습니다. Megalodon은 1.70의 학습 손실을 기록하며, Llama2-7B(1.75)와 13B(1.67) 사이의 중간 성능을 보였습니다. 코드: https://github.com/XuezheMax/megalodon

English

The quadratic complexity and weak length extrapolation of Transformers limits their ability to scale to long sequences, and while sub-quadratic solutions like linear attention and state space models exist, they empirically underperform Transformers in pretraining efficiency and downstream task accuracy. We introduce Megalodon, a neural architecture for efficient sequence modeling with unlimited context length. Megalodon inherits the architecture of Mega (exponential moving average with gated attention), and further introduces multiple technical components to improve its capability and stability, including complex exponential moving average (CEMA), timestep normalization layer, normalized attention mechanism and pre-norm with two-hop residual configuration. In a controlled head-to-head comparison with Llama2, Megalodon achieves better efficiency than Transformer in the scale of 7 billion parameters and 2 trillion training tokens. Megalodon reaches a training loss of 1.70, landing mid-way between Llama2-7B (1.75) and 13B (1.67). Code: https://github.com/XuezheMax/megalodon

Megalodon: 무제한 컨텍스트 길이를 통한 효율적인 대형 언어 모델 사전 학습 및 추론

Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length

초록

Support