學習跳過Transformer的中間層

摘要

條件計算是一種使Transformer模型更為高效的流行策略。現有方法通常針對單一模組（例如專家混合層）或獨立地跳過某些層。然而，可解釋性研究表明，Transformer的中間層表現出更大的冗餘性，且早期層會將信息聚合到特定的token位置。基於這些洞見，我們提出了一種新穎的架構，該架構能夠動態地從中間向外跳過可變數量的層。具體而言，一個學習到的門控機制根據輸入決定是否繞過一組對稱的中心區塊，而一個門控注意力機制則防止後續的token關注被跳過的token位置。我們通過「三明治」或「每層歸一化」方案來控制殘差範數，並通過自適應正則化損失來控制門控的稀疏性。我們原本旨在降低「較簡單」token的計算需求，並可能促進一種多層次表示層次的湧現，但在所研究的規模下，與層數較少的密集基線模型相比，我們的方法在驗證交叉熵與估計FLOPs之間的權衡上並未實現改進。我們已在https://github.com/tim-lawson/skip-middle上發布了我們的代碼。

English

Conditional computation is a popular strategy to make Transformers more efficient. Existing methods often target individual modules (e.g., mixture-of-experts layers) or skip layers independently of one another. However, interpretability research has demonstrated that the middle layers of Transformers exhibit greater redundancy, and that early layers aggregate information into token positions. Guided by these insights, we propose a novel architecture that dynamically skips a variable number of layers from the middle outward. In particular, a learned gating mechanism determines whether to bypass a symmetric span of central blocks based on the input, and a gated attention mechanism prevents subsequent tokens from attending to skipped token positions. Residual norms are controlled with a 'sandwich' or 'perilayernorm' scheme and gate sparsity with an adaptive regularization loss. We had aimed to reduce compute requirements for 'simpler' tokens and potentially foster an emergent multi-level representational hierarchy but, at the scales investigated, our approach does not achieve improvements in the trade-off between validation cross-entropy and estimated FLOPs compared to dense baselines with fewer layers. We release our code at https://github.com/tim-lawson/skip-middle.

學習跳過Transformer的中間層

Learning to Skip the Middle Layers of Transformers

摘要

Support