多头注意力机制中的强彩票假说
The Strong Lottery Ticket Hypothesis for Multi-Head Attention Mechanisms
November 6, 2025
作者: Hikari Otsuka, Daiki Chijiwa, Yasuyuki Okoshi, Daichi Fujiki, Susumu Takeuchi, Masato Motomura
cs.AI
摘要
强彩票假说(SLTH)提出,在随机初始化的神经网络中隐藏着高性能子网络,即强彩票(SLT)。尽管近期理论研究已在多种神经架构中证实了SLTH,但针对Transformer架构的SLTH仍缺乏理论支撑。特别是当前SLTH理论尚未涵盖多头注意力(MHA)机制——这一Transformer的核心组件。为填补此空白,我们首次对MHA内部存在SLT的可能性进行了理论分析。我们证明:若一个具有H个头、输入维度d的随机初始化MHA,其键值隐藏维度为O(dlog(Hd^{3/2})),则该网络极大概率包含可逼近任意同输入维度MHA的SLT。进一步地,基于此MHA理论,我们将SLTH扩展至无归一化层的Transformer架构。通过实验验证,我们发现源模型(MHA及Transformer)内部SLT与近似目标模型间的逼近误差,随源模型隐藏维度的增加呈指数级下降。
English
The strong lottery ticket hypothesis (SLTH) conjectures that high-performing
subnetworks, called strong lottery tickets (SLTs), are hidden in randomly
initialized neural networks. Although recent theoretical studies have
established the SLTH across various neural architectures, the SLTH for
transformer architectures still lacks theoretical understanding. In particular,
the current theory of the SLTH does not yet account for the multi-head
attention (MHA) mechanism, a core component of transformers. To address this
gap, we introduce a theoretical analysis of the existence of SLTs within MHAs.
We prove that, if a randomly initialized MHA of H heads and input dimension
d has the hidden dimension O(dlog(Hd^{3/2})) for the key and value, it
contains an SLT that approximates an arbitrary MHA with the same input
dimension with high probability. Furthermore, by leveraging this theory for
MHAs, we extend the SLTH to transformers without normalization layers. We
empirically validate our theoretical findings, demonstrating that the
approximation error between the SLT within a source model (MHA and transformer)
and an approximate target counterpart decreases exponentially by increasing the
hidden dimension of the source model.