ChatPaper.aiChatPaper

原生稀疏注意力:硬件對齊且可原生訓練的稀疏注意力機制

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

February 16, 2025
作者: Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, Wangding Zeng
cs.AI

摘要

長上下文建模對於下一代語言模型至關重要,然而標準注意力機制的高計算成本帶來了顯著的計算挑戰。稀疏注意力提供了一條提升效率同時保持模型能力的有前景的方向。我們提出了NSA,一種原生可訓練的稀疏注意力機制,它將算法創新與硬件對齊的優化相結合,以實現高效的長上下文建模。NSA採用了一種動態分層稀疏策略,結合了粗粒度的令牌壓縮與細粒度的令牌選擇,以保持全局上下文感知和局部精確性。我們的方法在稀疏注意力設計上取得了兩項關鍵創新:(1) 通過算術強度平衡的算法設計,並針對現代硬件進行實現優化,實現了顯著的加速。(2) 我們實現了端到端的訓練,在不犧牲模型性能的情況下減少了預訓練的計算量。如圖1所示,實驗表明,使用NSA預訓練的模型在通用基準測試、長上下文任務和基於指令的推理中保持或超越了全注意力模型。同時,NSA在64k長度序列的解碼、前向傳播和反向傳播上相較於全注意力實現了顯著的加速,驗證了其在模型整個生命週期中的高效性。
English
Long-context modeling is crucial for next-generation language models, yet the high computational cost of standard attention mechanisms poses significant computational challenges. Sparse attention offers a promising direction for improving efficiency while maintaining model capabilities. We present NSA, a Natively trainable Sparse Attention mechanism that integrates algorithmic innovations with hardware-aligned optimizations to achieve efficient long-context modeling. NSA employs a dynamic hierarchical sparse strategy, combining coarse-grained token compression with fine-grained token selection to preserve both global context awareness and local precision. Our approach advances sparse attention design with two key innovations: (1) We achieve substantial speedups through arithmetic intensity-balanced algorithm design, with implementation optimizations for modern hardware. (2) We enable end-to-end training, reducing pretraining computation without sacrificing model performance. As shown in Figure 1, experiments show the model pretrained with NSA maintains or exceeds Full Attention models across general benchmarks, long-context tasks, and instruction-based reasoning. Meanwhile, NSA achieves substantial speedups over Full Attention on 64k-length sequences across decoding, forward propagation, and backward propagation, validating its efficiency throughout the model lifecycle.

Summary

AI-Generated Summary

PDF15510February 18, 2025