MoH: マルチヘッド注意機構をヘッドの混合注意として

要旨

本研究では、Transformerモデルの中核であるマルチヘッド注意機構をアップグレードし、効率を向上させつつ、以前の精度レベルを維持または上回るようにしました。マルチヘッド注意は、総和形式で表現できることを示します。すべての注意ヘッドが同等に重要でないという洞察に基づき、私たちはMixture-of-Experts（MoE）機構の専門家として注意ヘッドを扱う新しいアーキテクチャであるMixture-of-Head attention（MoH）を提案します。MoHには2つの重要な利点があります。まず、MoHは各トークンが適切な注意ヘッドを選択できるようにし、推論効率を向上させるだけでなく、精度を損なうことなくパラメータ数を増やすことなくもします。第二に、MoHはマルチヘッド注意の標準的な総和を重み付き総和に置き換え、注意機構に柔軟性をもたらし、追加の性能ポテンシャルを引き出します。ViT、DiT、LLMsに関する幅広い実験は、MoHが注目すべき注意ヘッドのみを使用してマルチヘッド注意を上回ることを示しています。さらに、LLaMA3-8Bなどの事前学習されたマルチヘッド注意モデルを、MoHモデルにさらに調整することができることを示しています。特に、MoH-LLaMA3-8Bは、注意ヘッドの75%のみを利用して、14のベンチマーク全体で64.0%の平均精度を達成し、LLaMA3-8Bを2.4%上回ります。提案されたMoHは、マルチヘッド注意の有望な代替手段であり、高度で効率的な注意ベースのモデルの開発の強力な基盤を提供すると考えています。

English

In this work, we upgrade the multi-head attention mechanism, the core of the Transformer model, to improve efficiency while maintaining or surpassing the previous accuracy level. We show that multi-head attention can be expressed in the summation form. Drawing on the insight that not all attention heads hold equal significance, we propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters. Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential. Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only 50%-90% of the attention heads. Moreover, we demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads. We believe the proposed MoH is a promising alternative to multi-head attention and provides a strong foundation for developing advanced and efficient attention-based models.

MoH: マルチヘッド注意機構をヘッドの混合注意として

MoH: Multi-Head Attention as Mixture-of-Head Attention

要旨

Support