ChatPaper.aiChatPaper

MoA: 稀疏注意力混合以自動大型語言模型壓縮

MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression

June 21, 2024
作者: Tianyu Fu, Haofeng Huang, Xuefei Ning, Genghan Zhang, Boju Chen, Tianqi Wu, Hongyi Wang, Zixiao Huang, Shiyao Li, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang
cs.AI

摘要

稀疏注意力可以有效地減輕大型語言模型(LLMs)在長文本中對記憶和吞吐量的巨大需求。現有方法通常採用均勻的稀疏注意力遮罩,在不同的注意力頭和輸入長度之間應用相同的稀疏模式。然而,這種統一方法無法捕捉LLMs中固有的多樣化注意力模式,忽略了它們獨特的準確性和延遲之間的折衷。為應對這一挑戰,我們提出了混合注意力(MoA),可以自動為不同的注意力頭和層級量身定制不同的稀疏注意力配置。MoA構建並導航各種注意力模式及其相對於輸入序列長度的縮放規則的搜索空間。它對模型進行分析,評估潛在的配置,並找出最佳的稀疏注意力壓縮計劃。MoA適應不同的輸入大小,揭示了一些注意力頭擴展其焦點以容納更長序列,而其他注意力頭則一直集中在固定長度的本地上下文上。實驗表明,MoA使有效上下文長度增加了3.9倍,並在Vicuna-7B、Vicuna-13B和Llama3-8B模型上將檢索準確性提高了1.5-7.1倍,超過了統一注意力基線。此外,MoA縮小了稀疏模型和密集模型之間的能力差距,將最大相對性能下降從9%-36%降至在兩個長文本理解基準測試中不超過5%。MoA實現了單個GPU上7B和13B密集模型的1.2-1.4倍GPU內存減少,並將解碼吞吐量提高了5.5-6.7倍,對性能影響很小。
English
Sparse attention can effectively mitigate the significant memory and throughput demands of Large Language Models (LLMs) in long contexts. Existing methods typically employ a uniform sparse attention mask, applying the same sparse pattern across different attention heads and input lengths. However, this uniform approach fails to capture the diverse attention patterns inherent in LLMs, ignoring their distinct accuracy-latency trade-offs. To address this challenge, we propose the Mixture of Attention (MoA), which automatically tailors distinct sparse attention configurations to different heads and layers. MoA constructs and navigates a search space of various attention patterns and their scaling rules relative to input sequence lengths. It profiles the model, evaluates potential configurations, and pinpoints the optimal sparse attention compression plan. MoA adapts to varying input sizes, revealing that some attention heads expand their focus to accommodate longer sequences, while other heads consistently concentrate on fixed-length local contexts. Experiments show that MoA increases the effective context length by 3.9times with the same average attention span, boosting retrieval accuracy by 1.5-7.1times over the uniform-attention baseline across Vicuna-7B, Vicuna-13B, and Llama3-8B models. Moreover, MoA narrows the capability gaps between sparse and dense models, reducing the maximum relative performance drop from 9%-36% to within 5% across two long-context understanding benchmarks. MoA achieves a 1.2-1.4times GPU memory reduction and boosts decode throughput by 5.5-6.7 times for 7B and 13B dense models on a single GPU, with minimal impact on performance.

Summary

AI-Generated Summary

PDF154November 29, 2024