塑形Transformer：無限深度和寬度限制下的注意力模型

摘要

在深度學習理論中，表示的共變矩陣被視為一個代理，用於檢查網絡的可訓練性。受到Transformer成功的啟發，我們研究了具有跳躍連接的修改Softmax-based注意力模型的共變矩陣，在無限深度和寬度的比例極限下。我們展示了在初始化時，極限分佈可以通過一個隨機微分方程（SDE）來描述，其索引為深度寬度比。為了實現一個明確定義的隨機極限，Transformer的注意力機制被修改，通過將Softmax輸出居中於身份，並通過依賴寬度的溫度參數來調整Softmax logits。我們通過相應的SDE檢驗網絡的穩定性，展示了如何通過剩餘連接優雅地控制漂移和擴散的規模。穩定SDE的存在意味著共變結構表現良好，即使對於非常大的深度和寬度，也能防止深度注意力模型中的秩退化問題。最後，我們通過模擬展示，SDE對應的有限尺寸模型提供了一個令人驚訝的描述。我們為這些架構修改命名為“形狀Transformer”。

English

In deep learning theory, the covariance matrix of the representations serves as a proxy to examine the network's trainability. Motivated by the success of Transformers, we study the covariance matrix of a modified Softmax-based attention model with skip connections in the proportional limit of infinite-depth-and-width. We show that at initialization the limiting distribution can be described by a stochastic differential equation (SDE) indexed by the depth-to-width ratio. To achieve a well-defined stochastic limit, the Transformer's attention mechanism is modified by centering the Softmax output at identity, and scaling the Softmax logits by a width-dependent temperature parameter. We examine the stability of the network through the corresponding SDE, showing how the scale of both the drift and diffusion can be elegantly controlled with the aid of residual connections. The existence of a stable SDE implies that the covariance structure is well-behaved, even for very large depth and width, thus preventing the notorious issues of rank degeneracy in deep attention models. Finally, we show, through simulations, that the SDE provides a surprisingly good description of the corresponding finite-size model. We coin the name shaped Transformer for these architectural modifications.

塑形Transformer：無限深度和寬度限制下的注意力模型

The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit

摘要

Support