原生混合注意力机制：高效序列建模新方法

摘要

Transformer模型在序列建模方面表现出色，但面临二次方复杂度的问题，而线性注意力虽提升了效率，却常在长上下文场景下牺牲召回准确率。本研究提出了一种新型混合架构——原生混合注意力（NHA），它将线性注意力与全注意力相结合，通过统一层设计实现了层内与层间的双重混合。NHA利用线性RNN更新键值槽以保持长期上下文信息，并通过滑动窗口补充短期令牌。随后，对所有键值对应用单一的softmax注意力操作，实现无需额外融合参数的逐令牌、逐头部的上下文依赖权重分配。层间行为通过滑动窗口大小这一单一超参数调控，可在保持所有层结构一致的同时，平滑调整于纯线性与全注意力之间。实验结果表明，NHA在召回密集型和常识推理任务上超越了Transformer及其他混合基线模型。此外，预训练的大语言模型（LLMs）可与NHA进行结构混合，在保持竞争力的准确率的同时，显著提升效率。代码已发布于https://github.com/JusenD/NHA。

English

Transformers excel at sequence modeling but face quadratic complexity, while linear attention offers improved efficiency but often compromises recall accuracy over long contexts. In this work, we introduce Native Hybrid Attention (NHA), a novel hybrid architecture of linear and full attention that integrates both intra \& inter-layer hybridization into a unified layer design. NHA maintains long-term context in key-value slots updated by a linear RNN, and augments them with short-term tokens from a sliding window. A single softmax attention operation is then applied over all keys and values, enabling per-token and per-head context-dependent weighting without requiring additional fusion parameters. The inter-layer behavior is controlled through a single hyperparameter, the sliding window size, which allows smooth adjustment between purely linear and full attention while keeping all layers structurally uniform. Experimental results show that NHA surpasses Transformers and other hybrid baselines on recall-intensive and commonsense reasoning tasks. Furthermore, pretrained LLMs can be structurally hybridized with NHA, achieving competitive accuracy while delivering significant efficiency gains. Code is available at https://github.com/JusenD/NHA.

原生混合注意力机制：高效序列建模新方法

Native Hybrid Attention for Efficient Sequence Modeling

摘要

Support