掃描與捕捉：理解單層Transformer中的訓練動態與標記組成

摘要

Transformer架構在多個研究領域展現出令人印象深刻的性能，並成為許多神經網絡模型的基礎。然而，對其運作方式仍知之甚少。特別是在簡單的預測損失下，表示是如何從梯度訓練動態中出現仍然是一個謎。在本文中，針對具有一個自注意層和一個解碼器層的單層Transformer，我們以數學嚴謹的方式分析其SGD訓練動態，用於下一個token預測任務。我們揭開自注意層如何結合輸入token的動態過程的黑盒子，並揭示潛在歸納偏差的本質。更具體地，基於以下假設：(a)沒有位置編碼，(b)長輸入序列，以及(c)解碼器層學習速度快於自注意層，我們證明自注意作為一種具有辨識掃描算法的行為：從均勻的關注開始，逐漸更多地關注不同的關鍵token以預測特定的下一個token，並對出現在不同下一個token中的常見關鍵token進行較少的關注。在不同的token中，它逐漸降低關注權重，按照訓練集中關鍵和查詢token之間的低到高共現順序。有趣的是，這個過程並不導致取勝者通吃，而是因為兩層之間的學習速率而減速，留下（幾乎）固定的token組合。我們在合成和真實數據（WikiText）上驗證了這種“掃描和捕捉”動態。

English

Transformer architecture has shown impressive performance in multiple research domains and has become the backbone of many neural network models. However, there is limited understanding on how it works. In particular, with a simple predictive loss, how the representation emerges from the gradient training dynamics remains a mystery. In this paper, for 1-layer transformer with one self-attention layer plus one decoder layer, we analyze its SGD training dynamics for the task of next token prediction in a mathematically rigorous manner. We open the black box of the dynamic process of how the self-attention layer combines input tokens, and reveal the nature of underlying inductive bias. More specifically, with the assumption (a) no positional encoding, (b) long input sequence, and (c) the decoder layer learns faster than the self-attention layer, we prove that self-attention acts as a discriminative scanning algorithm: starting from uniform attention, it gradually attends more to distinct key tokens for a specific next token to be predicted, and pays less attention to common key tokens that occur across different next tokens. Among distinct tokens, it progressively drops attention weights, following the order of low to high co-occurrence between the key and the query token in the training set. Interestingly, this procedure does not lead to winner-takes-all, but decelerates due to a phase transition that is controllable by the learning rates of the two layers, leaving (almost) fixed token combination. We verify this \emph{scan and snap} dynamics on synthetic and real-world data (WikiText).

掃描與捕捉：理解單層Transformer中的訓練動態與標記組成

Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer Transformer

摘要

Support