MoH:多头注意力作为注意力头的混合
MoH: Multi-Head Attention as Mixture-of-Head Attention
October 15, 2024
作者: Peng Jin, Bo Zhu, Li Yuan, Shuicheng Yan
cs.AI
摘要
在这项工作中,我们升级了Transformer模型的核心,即多头注意力机制,以提高效率同时保持或超越先前的准确性水平。我们展示了多头注意力可以用求和形式表示。基于并非所有注意力头都具有相同重要性的观点,我们提出了混合头注意力(MoH),这是一种将注意力头视为专家的新架构,类似于专家混合(MoE)机制。MoH具有两个显著优势:首先,MoH使每个标记可以选择适当的注意力头,增强推理效率而不影响准确性或增加参数数量。其次,MoH将多头注意力中的标准求和替换为加权求和,为注意力机制引入了灵活性,释放了额外的性能潜力。对ViT、DiT和LLMs的广泛实验表明,MoH通过仅使用50%-90%的注意力头优于多头注意力。此外,我们展示了预训练的多头注意力模型,如LLaMA3-8B,可以进一步调整为我们的MoH模型。值得注意的是,MoH-LLaMA3-8B在14个基准测试中取得了64.0%的平均准确率,仅利用了75%的注意力头就比LLaMA3-8B高出2.4%。我们相信所提出的MoH是多头注意力的一个有前途的替代方案,并为开发先进和高效的基于注意力的模型奠定了坚实基础。
English
In this work, we upgrade the multi-head attention mechanism, the core of the
Transformer model, to improve efficiency while maintaining or surpassing the
previous accuracy level. We show that multi-head attention can be expressed in
the summation form. Drawing on the insight that not all attention heads hold
equal significance, we propose Mixture-of-Head attention (MoH), a new
architecture that treats attention heads as experts in the Mixture-of-Experts
(MoE) mechanism. MoH has two significant advantages: First, MoH enables each
token to select the appropriate attention heads, enhancing inference efficiency
without compromising accuracy or increasing the number of parameters. Second,
MoH replaces the standard summation in multi-head attention with a weighted
summation, introducing flexibility to the attention mechanism and unlocking
extra performance potential. Extensive experiments on ViT, DiT, and LLMs
demonstrate that MoH outperforms multi-head attention by using only 50%-90% of
the attention heads. Moreover, we demonstrate that pre-trained multi-head
attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH
models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14
benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the
attention heads. We believe the proposed MoH is a promising alternative to
multi-head attention and provides a strong foundation for developing advanced
and efficient attention-based models.Summary
AI-Generated Summary