ChatPaper.aiChatPaper

MHLA:透過詞元級多頭注意力恢復線性注意力的表達能力

MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head

January 12, 2026
作者: Kewei Zhang, Ye Huang, Yufan Deng, Jincheng Yu, Junsong Chen, Huan Ling, Enze Xie, Daquan Zhou
cs.AI

摘要

儘管Transformer架構在多個領域佔據主導地位,但其二次方的自注意力計算複雜度限制了在大規模應用中的使用。線性注意力雖能提供高效替代方案,但直接應用往往導致性能下降,現有改進方法通常通過引入額外模組(如深度可分離卷積)重新帶來計算開銷,背離了初衷。本研究發現這類方法的關鍵缺陷在於全局上下文坍縮現象,即模型喪失表徵多樣性。為解決此問題,我們提出多頭線性注意力機制(MHLA),通過在詞元維度上劃分頭部並分別計算注意力來保持表徵多樣性。我們證明MHLA在維持線性複雜度的同時,能恢復softmax注意力的大部分表達能力,並在多個領域驗證其有效性:在相同時間複雜度下,ImageNet分類任務提升3.6%,自然語言處理任務提升6.3%,圖像生成任務提升12.6%,視頻生成任務更實現41%的性能飛躍。
English
While the Transformer architecture dominates many fields, its quadratic self-attention complexity hinders its use in large-scale applications. Linear attention offers an efficient alternative, but its direct application often degrades performance, with existing fixes typically re-introducing computational overhead through extra modules (e.g., depthwise separable convolution) that defeat the original purpose. In this work, we identify a key failure mode in these methods: global context collapse, where the model loses representational diversity. To address this, we propose Multi-Head Linear Attention (MHLA), which preserves this diversity by computing attention within divided heads along the token dimension. We prove that MHLA maintains linear complexity while recovering much of the expressive power of softmax attention, and verify its effectiveness across multiple domains, achieving a 3.6\% improvement on ImageNet classification, a 6.3\% gain on NLP, a 12.6\% improvement on image generation, and a 41\% enhancement on video generation under the same time complexity.
PDF513February 7, 2026