Taipan：具有选择性注意力的高效表达能力状态空间语言模型

摘要

在自然语言处理（NLP）中，高效的长文本语言建模仍然是一个重要挑战。尽管Transformer在语言任务中占据主导地位，但由于训练中的二次计算复杂度和推理期间线性扩展的内存成本，它们在处理长序列时存在困难。最近的状态空间模型（SSMs）如Mamba提供了具有恒定内存使用的替代方案，但它们在需要大量上下文检索的任务中表现不佳。我们引入了Taipan，这是一种新颖的混合架构，将Mamba-2与选择性注意力层（SALs）相结合。这些SALs识别需要长距离交互的标记，去除不太重要的特征，然后使用注意力模块增强它们的表示。这种方法在内存密集型任务中平衡了Mamba的效率和类似Transformer的性能。通过限制注意力预算，Taipan将准确预测扩展到长达100万标记的上下文长度，同时保持计算效率。我们的实验表明，Taipan在各种规模和任务中表现出优越性能，为高效的长文本语言建模提供了一个有前途的解决方案。

English

Efficient long-context language modeling remains a significant challenge in Natural Language Processing (NLP). While Transformers dominate language tasks, they struggle with long sequences due to quadratic computational complexity in training and linearly scaling memory costs during inference. Recent State Space Models (SSMs) such as Mamba offer alternatives with constant memory usage, but they underperform in tasks requiring extensive in-context retrieval. We introduce Taipan, a novel hybrid architecture that combines Mamba-2 with Selective Attention Layers (SALs). These SALs identify tokens requiring long-range interactions, remove less important features, and then augment their representations using the attention module. This approach balances Mamba's efficiency with Transformer-like performance in memory-intensive tasks. By constraining the attention budget, Taipan extends accurate predictions to context lengths of up to 1 million tokens while preserving computational efficiency. Our experiments demonstrate Taipan's superior performance across various scales and tasks, offering a promising solution for efficient long-context language modeling.

Taipan：具有选择性注意力的高效表达能力状态空间语言模型

Taipan: Efficient and Expressive State Space Language Models with Selective Attention

摘要

Support