ChatPaper.aiChatPaper

混合线性注意力机制的系统性分析

A Systematic Analysis of Hybrid Linear Attention

July 8, 2025
作者: Dustin Wang, Rui-Jie Zhu, Steven Abreu, Yong Shan, Taylor Kergan, Yuqi Pan, Yuhong Chou, Zheng Li, Ge Zhang, Wenhao Huang, Jason Eshraghian
cs.AI

摘要

Transformer模型在处理长序列时面临二次方复杂度和内存问题,这促使了采用固定大小隐藏状态的线性注意力机制。然而,线性模型往往存在回忆性能有限的问题,从而催生了结合线性与全注意力层的混合架构。尽管混合架构研究广泛,但线性注意力组件的选择尚未得到深入探讨。我们系统地评估了各代线性注意力模型——从向量递归到高级门控机制——无论是独立使用还是混合使用。为了进行这一全面分析,我们训练并开源了72个模型:36个为340M参数(20B tokens),36个为1.3B参数(100B tokens),涵盖了五种混合比例下的六种线性注意力变体。在标准语言建模和回忆任务上的基准测试表明,优秀的独立线性模型在混合架构中未必表现最佳。虽然语言建模在从线性到全注意力的比例变化中保持稳定,但回忆性能随着全注意力层的增加而显著提升,特别是在3:1比例以下。我们的研究强调了选择性门控、层次递归和受控遗忘对于有效混合模型的关键作用。我们推荐采用HGRN-2或GatedDeltaNet等架构,线性与全注意力比例介于3:1至6:1之间,以高效实现Transformer级别的回忆性能。我们的模型已开源,详见https://huggingface.co/collections/m-a-p/hybrid-linear-attention-research-686c488a63d609d2f20e2b1e。
English
Transformers face quadratic complexity and memory issues with long sequences, prompting the adoption of linear attention mechanisms using fixed-size hidden states. However, linear models often suffer from limited recall performance, leading to hybrid architectures that combine linear and full attention layers. Despite extensive hybrid architecture research, the choice of linear attention component has not been deeply explored. We systematically evaluate various linear attention models across generations - vector recurrences to advanced gating mechanisms - both standalone and hybridized. To enable this comprehensive analysis, we trained and open-sourced 72 models: 36 at 340M parameters (20B tokens) and 36 at 1.3B parameters (100B tokens), covering six linear attention variants across five hybridization ratios. Benchmarking on standard language modeling and recall tasks reveals that superior standalone linear models do not necessarily excel in hybrids. While language modeling remains stable across linear-to-full attention ratios, recall significantly improves with increased full attention layers, particularly below a 3:1 ratio. Our study highlights selective gating, hierarchical recurrence, and controlled forgetting as critical for effective hybrid models. We recommend architectures such as HGRN-2 or GatedDeltaNet with a linear-to-full ratio between 3:1 and 6:1 to achieve Transformer-level recall efficiently. Our models are open-sourced at https://huggingface.co/collections/m-a-p/hybrid-linear-attention-research-686c488a63d609d2f20e2b1e.
PDF191July 10, 2025