MoH：多头注意力作为注意力头的混合

摘要

在这项工作中，我们升级了Transformer模型的核心，即多头注意力机制，以提高效率同时保持或超越先前的准确性水平。我们展示了多头注意力可以用求和形式表示。基于并非所有注意力头都具有相同重要性的观点，我们提出了混合头注意力（MoH），这是一种将注意力头视为专家的新架构，类似于专家混合（MoE）机制。MoH具有两个显著优势：首先，MoH使每个标记可以选择适当的注意力头，增强推理效率而不影响准确性或增加参数数量。其次，MoH将多头注意力中的标准求和替换为加权求和，为注意力机制引入了灵活性，释放了额外的性能潜力。对ViT、DiT和LLMs的广泛实验表明，MoH通过仅使用50%-90%的注意力头优于多头注意力。此外，我们展示了预训练的多头注意力模型，如LLaMA3-8B，可以进一步调整为我们的MoH模型。值得注意的是，MoH-LLaMA3-8B在14个基准测试中取得了64.0%的平均准确率，仅利用了75%的注意力头就比LLaMA3-8B高出2.4%。我们相信所提出的MoH是多头注意力的一个有前途的替代方案，并为开发先进和高效的基于注意力的模型奠定了坚实基础。

English

In this work, we upgrade the multi-head attention mechanism, the core of the Transformer model, to improve efficiency while maintaining or surpassing the previous accuracy level. We show that multi-head attention can be expressed in the summation form. Drawing on the insight that not all attention heads hold equal significance, we propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters. Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential. Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only 50%-90% of the attention heads. Moreover, we demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads. We believe the proposed MoH is a promising alternative to multi-head attention and provides a strong foundation for developing advanced and efficient attention-based models.

MoH：多头注意力作为注意力头的混合

MoH: Multi-Head Attention as Mixture-of-Head Attention

摘要

Support