L'Hypothèse du Billet de Loterie Gagnant pour les Mécanismes d'Attention Multi-Têtes

papers.abstract

L'hypothèse des billets de loterie forts (SLTH) conjecture que des sous-réseaux performants, appelés billets de loterie forts (SLT), sont cachés dans les réseaux de neurones initialisés aléatoirement. Bien que des études théoriques récentes aient établi la SLTH pour diverses architectures neuronales, la SLTH pour les architectures de transformateurs manque encore de fondements théoriques. En particulier, la théorie actuelle de la SLTH ne prend pas encore en compte le mécanisme d'attention multi-têtes (MHA), une composante essentielle des transformateurs. Pour combler cette lacune, nous introduisons une analyse théorique de l'existence de SLTs au sein des MHA. Nous démontrons que si un MHA initialisé aléatoirement avec H têtes et une dimension d'entrée d possède une dimension cachée O(dlog(Hd^{3/2})) pour la clé et la valeur, il contient un SLT qui approxime un MHA arbitraire avec la même dimension d'entrée avec une forte probabilité. De plus, en exploitant cette théorie pour les MHA, nous étendons la SLTH aux transformateurs sans couches de normalisation. Nous validons empiriquement nos résultats théoriques, démontrant que l'erreur d'approximation entre le SLT contenu dans un modèle source (MHA et transformateur) et une cible approximative équivalente décroît exponentiellement lorsqu'on augmente la dimension cachée du modèle source.

English

The strong lottery ticket hypothesis (SLTH) conjectures that high-performing subnetworks, called strong lottery tickets (SLTs), are hidden in randomly initialized neural networks. Although recent theoretical studies have established the SLTH across various neural architectures, the SLTH for transformer architectures still lacks theoretical understanding. In particular, the current theory of the SLTH does not yet account for the multi-head attention (MHA) mechanism, a core component of transformers. To address this gap, we introduce a theoretical analysis of the existence of SLTs within MHAs. We prove that, if a randomly initialized MHA of H heads and input dimension d has the hidden dimension O(dlog(Hd^{3/2})) for the key and value, it contains an SLT that approximates an arbitrary MHA with the same input dimension with high probability. Furthermore, by leveraging this theory for MHAs, we extend the SLTH to transformers without normalization layers. We empirically validate our theoretical findings, demonstrating that the approximation error between the SLT within a source model (MHA and transformer) and an approximate target counterpart decreases exponentially by increasing the hidden dimension of the source model.

L'Hypothèse du Billet de Loterie Gagnant pour les Mécanismes d'Attention Multi-Têtes

The Strong Lottery Ticket Hypothesis for Multi-Head Attention Mechanisms

papers.abstract

Support