ChatPaper.aiChatPaper

多頭注意力機制的強力樂透假說

The Strong Lottery Ticket Hypothesis for Multi-Head Attention Mechanisms

November 6, 2025
作者: Hikari Otsuka, Daiki Chijiwa, Yasuyuki Okoshi, Daichi Fujiki, Susumu Takeuchi, Masato Motomura
cs.AI

摘要

強力樂透彩假說(SLTH)提出一項猜想:在隨機初始化的神經網絡中,隱藏著被稱為強力樂透彩(SLT)的高性能子網絡。儘管近期理論研究已在不同神經架構中證實了SLTH,但針對Transformer架構的SLTH仍缺乏理論理解。特別是目前關於SLTH的理論尚未涵蓋多頭注意力(MHA)機制——Transformer的核心組件。為填補此理論空白,我們對MHA內部存在SLT的可能性展開理論分析。我們證明:若一個隨機初始化的MHA(具有H個注意力頭與輸入維度d)其鍵值對的隱藏維度為O(dlog(Hd^{3/2})),則該MHA極大概率包含能近似任意同輸入維度MHA的SLT。進一步地,通過運用此MHA理論,我們將SLTH擴展至無歸一化層的Transformer架構。我們通過實驗驗證了理論發現:當提升源模型(MHA與Transformer)的隱藏維度時,源模型內SLT與近似目標模型之間的誤差會呈指數級下降。
English
The strong lottery ticket hypothesis (SLTH) conjectures that high-performing subnetworks, called strong lottery tickets (SLTs), are hidden in randomly initialized neural networks. Although recent theoretical studies have established the SLTH across various neural architectures, the SLTH for transformer architectures still lacks theoretical understanding. In particular, the current theory of the SLTH does not yet account for the multi-head attention (MHA) mechanism, a core component of transformers. To address this gap, we introduce a theoretical analysis of the existence of SLTs within MHAs. We prove that, if a randomly initialized MHA of H heads and input dimension d has the hidden dimension O(dlog(Hd^{3/2})) for the key and value, it contains an SLT that approximates an arbitrary MHA with the same input dimension with high probability. Furthermore, by leveraging this theory for MHAs, we extend the SLTH to transformers without normalization layers. We empirically validate our theoretical findings, demonstrating that the approximation error between the SLT within a source model (MHA and transformer) and an approximate target counterpart decreases exponentially by increasing the hidden dimension of the source model.
PDF154December 2, 2025