MoH: Multi-Head Attention als Mischung-von-Kopf-Aufmerksamkeit

papers.abstract

In dieser Arbeit verbessern wir den Multi-Head-Attention-Mechanismus, den Kern des Transformer-Modells, um die Effizienz zu steigern, während wir das bisherige Genauigkeitsniveau beibehalten oder übertreffen. Wir zeigen, dass der Multi-Head-Attention in Form einer Summe ausgedrückt werden kann. Basierend auf der Erkenntnis, dass nicht alle Aufmerksamkeitsköpfe gleich wichtig sind, schlagen wir Mixture-of-Head-Attention (MoH) vor, eine neue Architektur, die Aufmerksamkeitsköpfe als Experten im Mixture-of-Experts (MoE)-Mechanismus behandelt. MoH hat zwei wesentliche Vorteile: Erstens ermöglicht MoH jedem Token, die geeigneten Aufmerksamkeitsköpfe auszuwählen, was die Inferenzeffizienz verbessert, ohne die Genauigkeit zu beeinträchtigen oder die Anzahl der Parameter zu erhöhen. Zweitens ersetzt MoH die Standard-Summe im Multi-Head-Attention durch eine gewichtete Summe, was der Aufmerksamkeitsmechanismus flexibler macht und zusätzliches Leistungspotenzial freisetzt. Umfangreiche Experimente mit ViT, DiT und LLMs zeigen, dass MoH Multi-Head-Attention übertrifft, indem es nur 50%-90% der Aufmerksamkeitsköpfe verwendet. Darüber hinaus zeigen wir, dass vorab trainierte Multi-Head-Attention-Modelle, wie z.B. LLaMA3-8B, weiterhin in unsere MoH-Modelle überführt werden können. Bemerkenswert ist, dass MoH-LLaMA3-8B eine durchschnittliche Genauigkeit von 64,0% über 14 Benchmarks erreicht und LLaMA3-8B um 2,4% übertrifft, indem es nur 75% der Aufmerksamkeitsköpfe verwendet. Wir glauben, dass das vorgeschlagene MoH eine vielversprechende Alternative zu Multi-Head-Attention darstellt und eine solide Grundlage für die Entwicklung fortschrittlicher und effizienter aufmerksamkeitsbasierter Modelle bietet.

English

In this work, we upgrade the multi-head attention mechanism, the core of the Transformer model, to improve efficiency while maintaining or surpassing the previous accuracy level. We show that multi-head attention can be expressed in the summation form. Drawing on the insight that not all attention heads hold equal significance, we propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters. Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential. Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only 50%-90% of the attention heads. Moreover, we demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads. We believe the proposed MoH is a promising alternative to multi-head attention and provides a strong foundation for developing advanced and efficient attention-based models.

MoH: Multi-Head Attention als Mischung-von-Kopf-Aufmerksamkeit

MoH: Multi-Head Attention as Mixture-of-Head Attention

papers.abstract

Support