多头注意力机制中的强彩票假说

摘要

强彩票假说（SLTH）提出，在随机初始化的神经网络中隐藏着高性能子网络，即强彩票（SLT）。尽管近期理论研究已在多种神经架构中证实了SLTH，但针对Transformer架构的SLTH仍缺乏理论支撑。特别是当前SLTH理论尚未涵盖多头注意力（MHA）机制——这一Transformer的核心组件。为填补此空白，我们首次对MHA内部存在SLT的可能性进行了理论分析。我们证明：若一个具有H个头、输入维度d的随机初始化MHA，其键值隐藏维度为O(dlog(Hd^{3/2}))，则该网络极大概率包含可逼近任意同输入维度MHA的SLT。进一步地，基于此MHA理论，我们将SLTH扩展至无归一化层的Transformer架构。通过实验验证，我们发现源模型（MHA及Transformer）内部SLT与近似目标模型间的逼近误差，随源模型隐藏维度的增加呈指数级下降。

English

The strong lottery ticket hypothesis (SLTH) conjectures that high-performing subnetworks, called strong lottery tickets (SLTs), are hidden in randomly initialized neural networks. Although recent theoretical studies have established the SLTH across various neural architectures, the SLTH for transformer architectures still lacks theoretical understanding. In particular, the current theory of the SLTH does not yet account for the multi-head attention (MHA) mechanism, a core component of transformers. To address this gap, we introduce a theoretical analysis of the existence of SLTs within MHAs. We prove that, if a randomly initialized MHA of H heads and input dimension d has the hidden dimension O(dlog(Hd^{3/2})) for the key and value, it contains an SLT that approximates an arbitrary MHA with the same input dimension with high probability. Furthermore, by leveraging this theory for MHAs, we extend the SLTH to transformers without normalization layers. We empirically validate our theoretical findings, demonstrating that the approximation error between the SLT within a source model (MHA and transformer) and an approximate target counterpart decreases exponentially by increasing the hidden dimension of the source model.

多头注意力机制中的强彩票假说

The Strong Lottery Ticket Hypothesis for Multi-Head Attention Mechanisms

摘要

Support