混合线性注意力的系统化分析
A Systematic Analysis of Hybrid Linear Attention
July 8, 2025
作者: Dustin Wang, Rui-Jie Zhu, Steven Abreu, Yong Shan, Taylor Kergan, Yuqi Pan, Yuhong Chou, Zheng Li, Ge Zhang, Wenhao Huang, Jason Eshraghian
cs.AI
摘要
Transformer在处理长序列时面临二次方复杂度和内存问题,这促使了采用固定大小隐藏状态的线性注意力机制。然而,线性模型常受限于较差的记忆性能,因此催生了结合线性与全注意力层的混合架构。尽管对混合架构进行了广泛研究,但线性注意力组件的选择尚未得到深入探讨。我们系统评估了各代线性注意力模型——从向量递归到高级门控机制——无论是单独使用还是混合应用。为了支持这一全面分析,我们训练并开源了72个模型:36个拥有3.4亿参数(训练于200亿token)和36个拥有13亿参数(训练于1000亿token),涵盖六种线性注意力变体及五种混合比例。在标准语言建模和记忆任务上的基准测试表明,表现优异的独立线性模型在混合架构中未必同样出色。虽然语言建模性能在从线性到全注意力的比例变化中保持稳定,但记忆性能随着全注意力层比例的增加显著提升,特别是在低于3:1的比例下。我们的研究强调了选择性门控、层次递归和可控遗忘对于构建高效混合模型的重要性。我们推荐采用HGRN-2或GatedDeltaNet等架构,并保持线性与全注意力比例在3:1至6:1之间,以实现Transformer级别的记忆效率。我们的模型已在https://huggingface.co/collections/m-a-p/hybrid-linear-attention-research-686c488a63d609d2f20e2b1e开源。
English
Transformers face quadratic complexity and memory issues with long sequences,
prompting the adoption of linear attention mechanisms using fixed-size hidden
states. However, linear models often suffer from limited recall performance,
leading to hybrid architectures that combine linear and full attention layers.
Despite extensive hybrid architecture research, the choice of linear attention
component has not been deeply explored. We systematically evaluate various
linear attention models across generations - vector recurrences to advanced
gating mechanisms - both standalone and hybridized. To enable this
comprehensive analysis, we trained and open-sourced 72 models: 36 at 340M
parameters (20B tokens) and 36 at 1.3B parameters (100B tokens), covering six
linear attention variants across five hybridization ratios. Benchmarking on
standard language modeling and recall tasks reveals that superior standalone
linear models do not necessarily excel in hybrids. While language modeling
remains stable across linear-to-full attention ratios, recall significantly
improves with increased full attention layers, particularly below a 3:1 ratio.
Our study highlights selective gating, hierarchical recurrence, and controlled
forgetting as critical for effective hybrid models. We recommend architectures
such as HGRN-2 or GatedDeltaNet with a linear-to-full ratio between 3:1 and 6:1
to achieve Transformer-level recall efficiently. Our models are open-sourced at
https://huggingface.co/collections/m-a-p/hybrid-linear-attention-research-686c488a63d609d2f20e2b1e.