混合深度注意力
Mixture-of-Depths Attention
March 16, 2026
作者: Lianghui Zhu, Yuxin Fang, Bencheng Liao, Shijie Wang, Tianheng Cheng, Zilong Huang, Chen Chen, Lai Wei, Yutao Zeng, Ya Wang, Yi Lin, Yu Li, Xinggang Wang
cs.AI
摘要
深度擴展是大型語言模型(LLM)發展的關鍵驅動力。然而,隨著模型深度增加,訊號衰減問題日益顯著:淺層形成的資訊特徵在殘差更新的反覆作用下逐漸稀釋,導致深層更難恢復這些特徵。本文提出混合深度注意力機制(MoDA),使每個注意力頭能同時處理當前層的序列鍵值對與前序層的深度鍵值對。我們進一步設計了硬體高效演算法,解決非連續記憶體存取模式問題,在序列長度64K時達到FlashAttention-2效能的97.3%。在15億參數模型上的實驗表明,MoDA持續優於強基線模型:在10個驗證基準上平均困惑度降低0.2,在10個下游任務中平均效能提升2.11%,而計算浮點運算量僅增加3.7%。此外,我們發現MoDA與後歸一化結合的效能優於其與前歸一化的組合。這些結果證明MoDA是極具潛力的深度擴展基礎模組。程式碼已發佈於https://github.com/hustvl/MoDA。
English
Scaling depth is a key driver for large language models (LLMs). Yet, as LLMs become deeper, they often suffer from signal degradation: informative features formed in shallow layers are gradually diluted by repeated residual updates, making them harder to recover in deeper layers. We introduce mixture-of-depths attention (MoDA), a mechanism that allows each attention head to attend to sequence KV pairs at the current layer and depth KV pairs from preceding layers. We further describe a hardware-efficient algorithm for MoDA that resolves non-contiguous memory-access patterns, achieving 97.3% of FlashAttention-2's efficiency at a sequence length of 64K. Experiments on 1.5B-parameter models demonstrate that MoDA consistently outperforms strong baselines. Notably, it improves average perplexity by 0.2 across 10 validation benchmarks and increases average performance by 2.11% on 10 downstream tasks, with a negligible 3.7% FLOPs computational overhead. We also find that combining MoDA with post-norm yields better performance than using it with pre-norm. These results suggest that MoDA is a promising primitive for depth scaling. Code is released at https://github.com/hustvl/MoDA .