Zebra：透過逐層分組的局部-全局注意力機制擴展上下文窗口

摘要

本文介紹了一種新方法，以增強大型語言模型（LLMs）在處理和理解廣泛文本序列方面的能力，這在需要深度理解和綜合大量信息的應用中至關重要。我們意識到在擴展基於Transformer架構的LLMs的上下文窗口時存在的固有挑戰，因此提出了一種新的模型架構，稱為Zebra。該架構通過使用分組的局部-全局注意力層，有效地處理了Transformer中全注意力機制帶來的二次時間和內存複雜性問題。我們的模型，類似於斑馬的交替條紋，平衡了局部和全局注意力層，顯著降低了計算需求和內存消耗。我們進行了全面的實驗，包括從頭開始的預訓練、持續進行長上下文適應訓練以及長指導調整，以評估Zebra的性能。結果表明，Zebra在短序列和長序列基準測試中實現了可比或更優秀的性能，同時提高了訓練和推理效率。

English

This paper introduces a novel approach to enhance the capabilities of Large Language Models (LLMs) in processing and understanding extensive text sequences, a critical aspect in applications requiring deep comprehension and synthesis of large volumes of information. Recognizing the inherent challenges in extending the context window for LLMs, primarily built on Transformer architecture, we propose a new model architecture, referred to as Zebra. This architecture efficiently manages the quadratic time and memory complexity issues associated with full attention in the Transformer by employing grouped local-global attention layers. Our model, akin to a zebra's alternating stripes, balances local and global attention layers, significantly reducing computational requirements and memory consumption. Comprehensive experiments, including pretraining from scratch, continuation of long context adaptation training, and long instruction tuning, are conducted to evaluate the Zebra's performance. The results show that Zebra achieves comparable or superior performance on both short and long sequence benchmarks, while also enhancing training and inference efficiency.

Zebra：透過逐層分組的局部-全局注意力機制擴展上下文窗口

Zebra: Extending Context Window with Layerwise Grouped Local-Global Attention

摘要

Support