塑形Transformer：在无限深度和宽度极限中的注意力模型

摘要

在深度学习理论中，表示的协方差矩阵充当了检验网络可训练性的代理。受到Transformer成功的启发，我们研究了带有跳跃连接的修改Softmax-based注意力模型的协方差矩阵，在无限深度和宽度的比例极限下。我们展示了在初始化时，极限分布可以用深度-宽度比率索引的随机微分方程（SDE）来描述。为了实现良好定义的随机极限，Transformer的注意力机制通过将Softmax输出居中于单位矩阵，并通过依赖于宽度的温度参数来缩放Softmax对数。我们通过相应的SDE检验了网络的稳定性，展示了如何借助残差连接优雅地控制漂移和扩散的尺度。稳定SDE的存在意味着协方差结构表现良好，即使在非常大的深度和宽度下，也能避免深度注意力模型中的秩退化等问题。最后，我们通过模拟展示了SDE对应有限尺寸模型的出人意料的良好描述。我们为这些架构修改命名为“形状Transformer”。

English

In deep learning theory, the covariance matrix of the representations serves as a proxy to examine the network's trainability. Motivated by the success of Transformers, we study the covariance matrix of a modified Softmax-based attention model with skip connections in the proportional limit of infinite-depth-and-width. We show that at initialization the limiting distribution can be described by a stochastic differential equation (SDE) indexed by the depth-to-width ratio. To achieve a well-defined stochastic limit, the Transformer's attention mechanism is modified by centering the Softmax output at identity, and scaling the Softmax logits by a width-dependent temperature parameter. We examine the stability of the network through the corresponding SDE, showing how the scale of both the drift and diffusion can be elegantly controlled with the aid of residual connections. The existence of a stable SDE implies that the covariance structure is well-behaved, even for very large depth and width, thus preventing the notorious issues of rank degeneracy in deep attention models. Finally, we show, through simulations, that the SDE provides a surprisingly good description of the corresponding finite-size model. We coin the name shaped Transformer for these architectural modifications.

塑形Transformer：在无限深度和宽度极限中的注意力模型

The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit

摘要

Support