Zebra: 계층별 그룹화된 지역-글로벌 어텐션을 통한 컨텍스트 윈도우 확장

초록

본 논문은 대규모 언어 모델(LLMs)이 방대한 텍스트 시퀀스를 처리하고 이해하는 능력을 향상시키기 위한 새로운 접근 방식을 소개한다. 이는 대량의 정보를 깊이 있게 이해하고 종합해야 하는 애플리케이션에서 중요한 측면이다. 트랜스포머(Transformer) 아키텍처를 기반으로 구축된 LLMs의 컨텍스트 윈도우 확장에 내재된 문제를 인식하며, 우리는 Zebra라 명명된 새로운 모델 아키텍처를 제안한다. 이 아키텍처는 그룹화된 로컬-글로벌 어텐션 레이어를 활용하여 트랜스포머의 전체 어텐션과 관련된 2차 시간 및 메모리 복잡성 문제를 효율적으로 관리한다. 얼룩말의 교대 줄무늬와 유사하게, 우리의 모델은 로컬과 글로벌 어텐션 레이어를 균형 있게 조정하여 계산 요구 사항과 메모리 소비를 크게 줄인다. Zebra의 성능을 평가하기 위해 처음부터의 사전 학습, 긴 컨텍스트 적응 훈련의 연속, 그리고 긴 명령어 튜닝을 포함한 포괄적인 실험이 수행되었다. 실험 결과, Zebra는 짧은 및 긴 시퀀스 벤치마크에서 비슷하거나 우수한 성능을 달성함과 동시에 훈련 및 추론 효율성을 향상시켰다.

English

This paper introduces a novel approach to enhance the capabilities of Large Language Models (LLMs) in processing and understanding extensive text sequences, a critical aspect in applications requiring deep comprehension and synthesis of large volumes of information. Recognizing the inherent challenges in extending the context window for LLMs, primarily built on Transformer architecture, we propose a new model architecture, referred to as Zebra. This architecture efficiently manages the quadratic time and memory complexity issues associated with full attention in the Transformer by employing grouped local-global attention layers. Our model, akin to a zebra's alternating stripes, balances local and global attention layers, significantly reducing computational requirements and memory consumption. Comprehensive experiments, including pretraining from scratch, continuation of long context adaptation training, and long instruction tuning, are conducted to evaluate the Zebra's performance. The results show that Zebra achieves comparable or superior performance on both short and long sequence benchmarks, while also enhancing training and inference efficiency.

Zebra: 계층별 그룹화된 지역-글로벌 어텐션을 통한 컨텍스트 윈도우 확장

Zebra: Extending Context Window with Layerwise Grouped Local-Global Attention

초록

Support